kamila {kamila} | R Documentation |
KAMILA clustering of mixed-type data.
Description
KAMILA is an iterative clustering method that equitably balances the contribution of continuous and categorical variables.
Usage
kamila(
conVar,
catFactor,
numClust,
numInit,
conWeights = rep(1, ncol(conVar)),
catWeights = rep(1, ncol(catFactor)),
maxIter = 25,
conInitMethod = "runif",
catBw = 0.025,
verbose = FALSE,
calcNumClust = "none",
numPredStrCvRun = 10,
predStrThresh = 0.8
)
Arguments
conVar |
A data frame of continuous variables. |
catFactor |
A data frame of factors. |
numClust |
The number of clusters returned by the algorithm. |
numInit |
The number of initializations used. |
conWeights |
A vector of continuous weights for the continuous variables. |
catWeights |
A vector of continuous weights for the categorical variables. |
maxIter |
The maximum number of iterations in each run. |
conInitMethod |
Character: The method used to initialize each run. |
catBw |
The bandwidth used for the categorical kernel. |
verbose |
Logical: Whether detailed results should be printed and returned. |
calcNumClust |
Character: Method for selecting the number of clusters. |
numPredStrCvRun |
Numeric: Number of CV runs for prediction strength method. Ignored unless calcNumClust == 'ps' |
predStrThresh |
Numeric: Threshold for prediction strength method. Ignored unless calcNumClust == 'ps' |
Details
KAMILA (KAy-means for MIxed LArge data sets) is an iterative clustering method that equitably balances the contribution of the continuous and categorical variables. It uses a kernel density estimation technique to flexibly model spherical clusters in the continuous domain, and uses a multinomial model in the categorical domain.
Weighting scheme: If no weights are desired, set all weights to 1 (the default setting). Let a_1, ..., a_p denote the weights for p continuous variables. Let b_1, ..., b_q denote the weights for q categorical variables. Currently, continuous weights are applied during the calculation of Euclidean distance, as: Categorical weights are applied to the log-likelihoods obtained by the level probabilities given cluster membership as: Total log likelihood for the kth cluster is obtained by weighting the single continuous log-likelihood by the mean of all continuous weights plus logLikCat_k: Note that weights between 0 and 1 are admissible; weights equal to zero completely remove a variable's influence on the clustering; weights equal to 1 leave a variable's contribution unchanged. Weights between 0 and 1 may not be comparable across continuous and categorical variables. Estimating the number of clusters: Default is no estimation method. Setting calcNumClust to 'ps' uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes. The user must specify the number of cross-validation runs and the threshold for determining the number of clusters. The smaller the threshold, the larger the number of clusters selected.
Value
A list with the following results objects:
finalMemb |
A numeric vector with cluster assignment indicated by integer. |
numIter |
|
finalLogLik |
The pseudo log-likelihood of the returned clustering. |
finalObj |
|
finalCenters |
|
finalProbs |
|
input |
Object with the given input parameter values. |
nClust |
An object describing the results of selecting the number of clusters, empty if calcNumClust == 'none'. |
verbose |
An optionally returned object with more detailed information. |
References
Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13
Examples
# Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar = 2, nCatVar = 2, nCatLevels = 4,
nConWithErr = 2, nCatWithErr = 2, popProportions = c(.5, .5),
conErrLev = 0.3, catErrLev = 0.8)
catDf <- data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE)
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)
kamRes <- kamila(conDf, catDf, numClust = 2, numInit = 10)
table(kamRes$finalMemb, dat$trueID)