clusterforest {C443}R Documentation

Clustering the classification trees in a forest based on similarities

Description

A function to get insight into a forest of classification trees by clustering the trees in a forest using Partitioning Around Medoids (PAM, Kaufman & Rousseeuw, 2009), based on user provided similarities, or based on similarities calculated by the package using a similarity measure chosen by the user (see Sies & Van Mechelen, 2020).

Usage

clusterforest(
  observeddata,
  treedata,
  trees,
  simmatrix = NULL,
  m = NULL,
  tol = NULL,
  weight = NULL,
  fromclus = 1,
  toclus = 1,
  treecov = NULL,
  sameobs = FALSE,
  seed = NULL
)

Arguments

observeddata

The entire observed dataset

treedata

A list of dataframes on which the trees are based

trees

A list of trees of class party, classes inheriting from party (e.g., glmtree), or classes that can be coerced to party (i.e., rpart, Weka_tree, XMLnode).

simmatrix

A similaritymatrix with the similarities between all trees. Should be square, symmetric and have ones on the diagonal. Default=NULL

m

Similarity measure that should be used to calculate similarities, in the case that no similarity matrix was provided by the user. Default=NULL. m=1 is based on counting common predictors; m=2 is based on counting common predictor-split point combinations; m=3 is based on common ordered sets of predictor-range part combinations (see Shannon & Banks (1999)); m=4 is based on the agreement of partitions implied by leaf membership (Chipman, 1998); m=5 is based on the agreement of partitions implied by class labels (Chipman, 1998); m=6 is based on the number of predictor occurrences in definitions of leaves with same class label; m=7 is based on the number of predictor-split point combinations in definitions of leaves with same class label m=8 measures closeness to logical equivalence (applicable in case of binary predictors only)

tol

A vector with for each predictor a number that defines the tolerance zone within which two split points of the predictor in question are assumed equal. For example, if the tolerance for predictor X is 1, then a split on that predictor in tree A will be assumed equal to a split in tree B as long as the splitpoint in tree B is within the splitpoint in tree A + or - 1. Only applicable for m=1 and m=6. Default=NULL

weight

If 1, the number of dissimilar paths in the Shannon and Banks measure (m=2), should be weighted by 1/their length (Otherwise they are weighted equally). Only applicable for m=2. Default=NULL

fromclus

The lowest number of clusters for which the PAM algorithm should be run. Default=1.

toclus

The highest number of clusters for which the PAM algorithm should be run. Default=1.

treecov

A vector/dataframe with the covariate value(s) for each tree in the forest (1 column per covariate).

sameobs

Are the same observations included in every tree data set? For example, in the case of subsamples or bootstrap samples, the answer is no. Default=FALSE

seed

A seed number that should be used for the multi start procedure (based on which initial medoids are assigned). Default=NULL.

Details

The user should provide the number of clusters that the solution should contain, or a range of numbers that should be explored. In the latter case, the resulting clusterforest object will contain clustering results for each solution. On this clusterforest object, several methods, such as plot, print and summary, can be used.

Value

The function returns an object of class clusterforest, with attributes:

medoids

the position of the medoid trees in the forest (i.e., which element of the list of partytrees)

medoidtrees

the medoid trees

clusters

The cluster to which each tree in the forest is assigned

avgsilwidth

The average silhouette width for each solution (see Kaufman and Rousseeuw, 2009)

accuracy

For each solution, the accuracy of the predicted class labels based on the medoids.

agreement

For each solution, the agreement between the predicted class label for each observation based on the forest as a whole, and those based on the medoids only (see Sies & Van Mechelen,2020)

withinsim

Within cluster similarity for each solution (see Sies & Van Mechelen, 2020)

treesimilarities

Similarity matrix on which clustering was based

treecov

covariate value(s) for each tree in the forest

seed

seed number that was used for the multi start procedure (based on which initial medoids were assigned)

References

Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.

Sies, A. & Van Mechelen I. (2020). C443: An R-package to see a forest for the trees. Journal of Classification.

Shannon, W. D., & Banks, D. (1999). Combining classification trees using MLE. Statistics in medicine, 18(6), 727-740.

Chipman, H. A., George, E. I., & McCulloh, R. E. (1998). Making sense of a forest of trees. Computing Science and Statistics, 84-92.

Examples

require(MASS)
require(rpart)
#Function to draw a bootstrap sample from a dataset
DrawBoots <- function(dataset, i){
set.seed(2394 + i)
Boot <- dataset[sample(1:nrow(dataset), size = nrow(dataset), replace = TRUE),]
return(Boot)
}

#Function to grow a tree using rpart on a dataset
GrowTree <- function(x,y,BootsSample, minsplit = 40, minbucket = 20, maxdepth =3){
 controlrpart <- rpart.control(minsplit = minsplit, minbucket = minbucket, maxdepth = maxdepth,
  maxsurrogate = 0, maxcompete = 0)
 tree <- rpart(as.formula(paste(noquote(paste(y, "~")), noquote(paste(x, collapse="+")))),
 data = BootsSample, control = controlrpart)
 return(tree)
}

#Use functions to draw 20 boostrapsamples and grow a tree on each sample
Boots<- lapply(1:10, function(k) DrawBoots(Pima.tr ,k))
Trees <- lapply(1:10, function (i) GrowTree(x=c("npreg", "glu",  "bp",  "skin",
"bmi", "ped", "age"), y="type", Boots[[i]] ))

#Clustering the trees in this forest
ClusterForest<- clusterforest(observeddata=Pima.tr,treedata=Boots,trees=Trees,m=1,
fromclus=1, toclus=5, sameobs=FALSE)

[Package C443 version 3.2.2 Index]