R: KAMILA clustering of mixed-type data.

kamila {kamila}

R Documentation

KAMILA clustering of mixed-type data.

Description

KAMILA is an iterative clustering method that equitably balances the contribution of continuous and categorical variables.

Usage

kamila(
  conVar,
  catFactor,
  numClust,
  numInit,
  conWeights = rep(1, ncol(conVar)),
  catWeights = rep(1, ncol(catFactor)),
  maxIter = 25,
  conInitMethod = "runif",
  catBw = 0.025,
  verbose = FALSE,
  calcNumClust = "none",
  numPredStrCvRun = 10,
  predStrThresh = 0.8
)

Arguments

`conVar`	A data frame of continuous variables.
`catFactor`	A data frame of factors.
`numClust`	The number of clusters returned by the algorithm.
`numInit`	The number of initializations used.
`conWeights`	A vector of continuous weights for the continuous variables.
`catWeights`	A vector of continuous weights for the categorical variables.
`maxIter`	The maximum number of iterations in each run.
`conInitMethod`	Character: The method used to initialize each run.
`catBw`	The bandwidth used for the categorical kernel.
`verbose`	Logical: Whether detailed results should be printed and returned.
`calcNumClust`	Character: Method for selecting the number of clusters.
`numPredStrCvRun`	Numeric: Number of CV runs for prediction strength method. Ignored unless calcNumClust == 'ps'
`predStrThresh`	Numeric: Threshold for prediction strength method. Ignored unless calcNumClust == 'ps'

Details

KAMILA (KAy-means for MIxed LArge data sets) is an iterative clustering method that equitably balances the contribution of the continuous and categorical variables. It uses a kernel density estimation technique to flexibly model spherical clusters in the continuous domain, and uses a multinomial model in the categorical domain.

Weighting scheme: If no weights are desired, set all weights to 1 (the default setting). Let a_1, ..., a_p denote the weights for p continuous variables. Let b_1, ..., b_q denote the weights for q categorical variables. Currently, continuous weights are applied during the calculation of Euclidean distance, as: Categorical weights are applied to the log-likelihoods obtained by the level probabilities given cluster membership as: Total log likelihood for the kth cluster is obtained by weighting the single continuous log-likelihood by the mean of all continuous weights plus logLikCat_k: Note that weights between 0 and 1 are admissible; weights equal to zero completely remove a variable's influence on the clustering; weights equal to 1 leave a variable's contribution unchanged. Weights between 0 and 1 may not be comparable across continuous and categorical variables. Estimating the number of clusters: Default is no estimation method. Setting calcNumClust to 'ps' uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes. The user must specify the number of cross-validation runs and the threshold for determining the number of clusters. The smaller the threshold, the larger the number of clusters selected.

Value

A list with the following results objects:

`finalMemb`	A numeric vector with cluster assignment indicated by integer.
`numIter`
`finalLogLik`	The pseudo log-likelihood of the returned clustering.
`finalObj`
`finalCenters`
`finalProbs`
`input`	Object with the given input parameter values.
`nClust`	An object describing the results of selecting the number of clusters, empty if calcNumClust == 'none'.
`verbose`	An optionally returned object with more detailed information.

References

Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13

Examples

# Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar = 2, nCatVar = 2, nCatLevels = 4,
  nConWithErr = 2, nCatWithErr = 2, popProportions = c(.5, .5),
  conErrLev = 0.3, catErrLev = 0.8)
catDf <- data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE)
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)

kamRes <- kamila(conDf, catDf, numClust = 2, numInit = 10)

table(kamRes$finalMemb, dat$trueID)

[Package kamila version 0.1.2 Index]