gmsClust {kamila} | R Documentation |
A general implementation of Modha-Spangler clustering for mixed-type data.
Description
Modha-Spangler clustering estimates the optimal weighting for continuous vs categorical variables using a brute-force search strategy.
Usage
gmsClust(
conData,
catData,
nclust,
searchDensity = 10,
clustFun = wkmeans,
conDist = squaredEuc,
catDist = squaredEuc,
...
)
Arguments
conData |
A data frame of continuous variables. |
catData |
A data frame of categorical variables; the allowable variable types depend on the specific clustering function used. |
nclust |
An integer specifying the number of clusters. |
searchDensity |
An integer determining the number of distinct cluster weightings evaluated in the brute-force search. |
clustFun |
The clustering function to be applied. |
conDist |
The continuous distance function used to construct the objective function. |
catDist |
The categorical distance function used to construct the objective function. |
... |
Arguments to be passed to the |
Details
Modha-Spangler clustering uses a brute-force search strategy to estimate the optimal weighting for continuous vs categorical variables. This implementation admits an arbitrary clustering function and arbitrary objective functions for continuous and categorical variables.
The input parameter clustFun must be a function accepting inputs (conData, catData, conWeight, nclust, ...) and returning a list containing (at least) the elements cluster, conCenters, and catCenters. The list element "cluster" contains cluster memberships denoted by the integers 1:nclust. The list elements "conCenters" and "catCenters" must be data frames whose rows denote cluster centroids. The function clustFun must allow nclust = 1, in which case $centers returns a data frame with a single row. Input parameters conDist and catDist are functions that must each take two data frame rows as input and return a scalar distance measure.
Value
A list containing the following results objects:
results |
A results object corresponding to the base clustering algorithm |
objFun |
A numeric vector of length |
Qcon |
A numeric vector of length |
Qcon |
A numeric vector of length |
bestInd |
The index of the most successful run |
weights |
A numeric vector of length |
References
Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13
Modha DS, Spangler WS; Feature Weighting in k-Means Clustering. Machine Learning, 52(3). 2003. doi: 10.1023/a:1024016609528
Examples
## Not run:
# Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar=2, nCatVar=2, nCatLevels=4, nConWithErr=2,
nCatWithErr=2, popProportions=c(.5,.5), conErrLev=0.3, catErrLev=0.8)
catDf <- dummyCodeFactorDf(data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE))
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)
msRes <- gmsClust(conDf, catDf, nclust=2)
table(msRes$results$cluster, dat$trueID)
## End(Not run)