clusterforest {C443} | R Documentation |
Clustering the classification trees in a forest based on similarities
Description
A function to get insight into a forest of classification trees by clustering the trees in a forest using Partitioning Around Medoids (PAM, Kaufman & Rousseeuw, 2009), based on user provided similarities, or based on similarities calculated by the package using a similarity measure chosen by the user (see Sies & Van Mechelen, 2020).
Usage
clusterforest(
observeddata,
treedata = NULL,
trees,
simmatrix = NULL,
m = NULL,
tol = NULL,
weight = NULL,
fromclus = 1,
toclus = 1,
treecov = NULL,
sameobs = FALSE,
seed = NULL,
no_cores = detectCores(logical = FALSE)
)
Arguments
observeddata |
The entire observed dataset |
treedata |
A list of dataframes on which the trees are based. Not necessary if the data set is included in the tree object already. |
trees |
A list of trees of class party, classes inheriting from party (e.g., glmtree), classes that can be coerced to party (i.e., rpart, Weka_tree, XMLnode), or a randomForest or ranger object. |
simmatrix |
A similaritymatrix with the similarities between all trees. Should be square, symmetric and have ones on the diagonal. Default=NULL |
m |
Similarity measure that should be used to calculate similarities, in the case that no similarity matrix was provided by the user. Default=NULL. m=1 is based on counting common predictors; m=2 is based on counting common predictor-split point combinations; m=3 is based on common ordered sets of predictor-range part combinations (see Shannon & Banks (1999)); m=4 is based on the agreement of partitions implied by leaf membership (Chipman, 1998); m=5 is based on the agreement of partitions implied by class labels (Chipman, 1998); m=6 is based on the number of predictor occurrences in definitions of leaves with same class label; m=7 is based on the number of predictor-split point combinations in definitions of leaves with same class label m=8 measures closeness to logical equivalence (applicable in case of binary predictors only) |
tol |
A vector with for each predictor a number that defines the tolerance zone within which two split points of the predictor in question are assumed equal. For example, if the tolerance for predictor X is 1, then a split on that predictor in tree A will be assumed equal to a split in tree B as long as the splitpoint in tree B is within the splitpoint in tree A + or - 1. Only applicable for m=1 and m=6. Default=NULL |
weight |
If 1, the number of dissimilar paths in the Shannon and Banks measure (m=2), should be weighted by 1/their length (Otherwise they are weighted equally). Only applicable for m=2. Default=NULL |
fromclus |
The lowest number of clusters for which the PAM algorithm should be run. Default=1. |
toclus |
The highest number of clusters for which the PAM algorithm should be run. Default=1. |
treecov |
A vector/dataframe with the covariate value(s) for each tree in the forest (1 column per covariate) in the case of known sources of variation underlying the forest, that should be linked to the clustering solution. |
sameobs |
Are the same observations included in every tree data set? For example, in the case of subsamples or bootstrap samples, the answer is no. Default=FALSE |
seed |
A seed number that should be used for the multi start procedure (based on which initial medoids are assigned). Default=NULL. |
no_cores |
Number of CPU cores used for computations. Default=detectCores(logical=FALSE) |
Details
The user should provide the number of clusters that the solution should contain, or a range of numbers that should be explored. In the latter case, the resulting clusterforest object will contain clustering results for each solution. On this clusterforest object, several methods, such as plot, print and summary, can be used.
Value
The function returns an object of class clusterforest, with attributes:
medoids |
the position of the medoid trees in the forest (i.e., which element of the list of partytrees) |
medoidtrees |
the medoid trees |
clusters |
The cluster to which each tree in the forest is assigned |
avgsilwidth |
The average silhouette width for each solution (see Kaufman and Rousseeuw, 2009) |
accuracy |
For each solution, the accuracy of the predicted class labels based on the medoids. |
agreement |
For each solution, the agreement between the predicted class label for each observation based on the forest as a whole, and those based on the medoids only (see Sies & Van Mechelen,2020) |
withinsim |
Within cluster similarity for each solution (see Sies & Van Mechelen, 2020) |
treesimilarities |
Similarity matrix on which clustering was based |
treecov |
covariate value(s) for each tree in the forest |
seed |
seed number that was used for the multi start procedure (based on which initial medoids were assigned) |
References
Kaufman, L., & Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis (Vol. 344). John Wiley & Sons.
Sies, A. & Van Mechelen I. (2020). C443: An R-package to see a forest for the trees. Journal of Classification.
Shannon, W. D., & Banks, D. (1999). Combining classification trees using MLE. Statistics in medicine, 18(6), 727-740.
Chipman, H. A., George, E. I., & McCulloh, R. E. (1998). Making sense of a forest of trees. Computing Science and Statistics, 84-92.
Examples
require(MASS)
require(ranger)
require(rpart)
#Function to draw a bootstrap sample from a dataset
DrawBoots <- function(dataset, i){
set.seed(2394 + i)
Boot <- dataset[sample(1:nrow(dataset), size = nrow(dataset), replace = TRUE),]
return(Boot)
}
#Function to grow a tree using rpart on a dataset
GrowTree <- function(x,y,BootsSample, minsplit = 40, minbucket = 20, maxdepth =3){
controlrpart <- rpart.control(minsplit = minsplit, minbucket = minbucket, maxdepth = maxdepth,
maxsurrogate = 0, maxcompete = 0)
tree <- rpart(as.formula(paste(noquote(paste(y, "~")), noquote(paste(x, collapse="+")))),
data = BootsSample, control = controlrpart)
return(tree)
}
#Use functions to draw 10 boostrapsamples and grow a tree on each sample
Boots<- lapply(1:10, function(k) DrawBoots(Pima.tr ,k))
Trees <- lapply(1:10, function (i) GrowTree(x=c("npreg", "glu", "bp", "skin",
"bmi", "ped", "age"), y="type", Boots[[i]] ))
#Clustering the trees in this forest
ClusterForest<- clusterforest(observeddata=Pima.tr,treedata=Boots,trees=Trees,m=1,
fromclus=1, toclus=2, sameobs=FALSE, no_cores=2)
#Example RandomForest
Pima.tr.ranger <- ranger(type ~ ., data = Pima.tr, keep.inbag = TRUE, num.trees=20,
max.depth=3)
ClusterForest<- clusterforest(observeddata=Pima.tr,trees=Pima.tr.ranger,m=5,
fromclus=1, toclus=2, sameobs=FALSE, no_cores=2)