| kamila {kamila} | R Documentation | 
KAMILA clustering of mixed-type data.
Description
KAMILA is an iterative clustering method that equitably balances the contribution of continuous and categorical variables.
Usage
kamila(
  conVar,
  catFactor,
  numClust,
  numInit,
  conWeights = rep(1, ncol(conVar)),
  catWeights = rep(1, ncol(catFactor)),
  maxIter = 25,
  conInitMethod = "runif",
  catBw = 0.025,
  verbose = FALSE,
  calcNumClust = "none",
  numPredStrCvRun = 10,
  predStrThresh = 0.8
)
Arguments
| conVar | A data frame of continuous variables. | 
| catFactor | A data frame of factors. | 
| numClust | The number of clusters returned by the algorithm. | 
| numInit | The number of initializations used. | 
| conWeights | A vector of continuous weights for the continuous variables. | 
| catWeights | A vector of continuous weights for the categorical variables. | 
| maxIter | The maximum number of iterations in each run. | 
| conInitMethod | Character: The method used to initialize each run. | 
| catBw | The bandwidth used for the categorical kernel. | 
| verbose | Logical: Whether detailed results should be printed and returned. | 
| calcNumClust | Character: Method for selecting the number of clusters. | 
| numPredStrCvRun | Numeric: Number of CV runs for prediction strength method. Ignored unless calcNumClust == 'ps' | 
| predStrThresh | Numeric: Threshold for prediction strength method. Ignored unless calcNumClust == 'ps' | 
Details
KAMILA (KAy-means for MIxed LArge data sets) is an iterative clustering method that equitably balances the contribution of the continuous and categorical variables. It uses a kernel density estimation technique to flexibly model spherical clusters in the continuous domain, and uses a multinomial model in the categorical domain.
Weighting scheme: If no weights are desired, set all weights to 1 (the default setting). Let a_1, ..., a_p denote the weights for p continuous variables. Let b_1, ..., b_q denote the weights for q categorical variables. Currently, continuous weights are applied during the calculation of Euclidean distance, as: Categorical weights are applied to the log-likelihoods obtained by the level probabilities given cluster membership as: Total log likelihood for the kth cluster is obtained by weighting the single continuous log-likelihood by the mean of all continuous weights plus logLikCat_k: Note that weights between 0 and 1 are admissible; weights equal to zero completely remove a variable's influence on the clustering; weights equal to 1 leave a variable's contribution unchanged. Weights between 0 and 1 may not be comparable across continuous and categorical variables. Estimating the number of clusters: Default is no estimation method. Setting calcNumClust to 'ps' uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes. The user must specify the number of cross-validation runs and the threshold for determining the number of clusters. The smaller the threshold, the larger the number of clusters selected.
Value
A list with the following results objects:
| finalMemb | A numeric vector with cluster assignment indicated by integer. | 
| numIter | |
| finalLogLik | The pseudo log-likelihood of the returned clustering. | 
| finalObj | |
| finalCenters | |
| finalProbs | |
| input | Object with the given input parameter values. | 
| nClust | An object describing the results of selecting the number of clusters, empty if calcNumClust == 'none'. | 
| verbose | An optionally returned object with more detailed information. | 
References
Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13
Examples
# Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar = 2, nCatVar = 2, nCatLevels = 4,
  nConWithErr = 2, nCatWithErr = 2, popProportions = c(.5, .5),
  conErrLev = 0.3, catErrLev = 0.8)
catDf <- data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE)
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)
kamRes <- kamila(conDf, catDf, numClust = 2, numInit = 10)
table(kamRes$finalMemb, dat$trueID)