R: A general implementation of Modha-Spangler clustering for...

gmsClust {kamila}

R Documentation

A general implementation of Modha-Spangler clustering for mixed-type data.

Description

Modha-Spangler clustering estimates the optimal weighting for continuous vs categorical variables using a brute-force search strategy.

Usage

gmsClust(
  conData,
  catData,
  nclust,
  searchDensity = 10,
  clustFun = wkmeans,
  conDist = squaredEuc,
  catDist = squaredEuc,
  ...
)

Arguments

`conData`	A data frame of continuous variables.
`catData`	A data frame of categorical variables; the allowable variable types depend on the specific clustering function used.
`nclust`	An integer specifying the number of clusters.
`searchDensity`	An integer determining the number of distinct cluster weightings evaluated in the brute-force search.
`clustFun`	The clustering function to be applied.
`conDist`	The continuous distance function used to construct the objective function.
`catDist`	The categorical distance function used to construct the objective function.
`...`	Arguments to be passed to the `clustFun`.

Details

Modha-Spangler clustering uses a brute-force search strategy to estimate the optimal weighting for continuous vs categorical variables. This implementation admits an arbitrary clustering function and arbitrary objective functions for continuous and categorical variables.

The input parameter clustFun must be a function accepting inputs (conData, catData, conWeight, nclust, ...) and returning a list containing (at least) the elements cluster, conCenters, and catCenters. The list element "cluster" contains cluster memberships denoted by the integers 1:nclust. The list elements "conCenters" and "catCenters" must be data frames whose rows denote cluster centroids. The function clustFun must allow nclust = 1, in which case $centers returns a data frame with a single row. Input parameters conDist and catDist are functions that must each take two data frame rows as input and return a scalar distance measure.

Value

A list containing the following results objects:

`results`	A results object corresponding to the base clustering algorithm
`objFun`	A numeric vector of length `searchDensity` containing the values of the objective function for each weight used
`Qcon`	A numeric vector of length `searchDensity` containing the values of the continuous component of the objective function
`Qcon`	A numeric vector of length `searchDensity` containing the values of the categorical component of the objective function
`bestInd`	The index of the most successful run
`weights`	A numeric vector of length `searchDensity` containing the continuous weights used

References

Foss A, Markatou M; kamila: Clustering Mixed-Type Data in R and Hadoop. Journal of Statistical Software, 83(13). 2018. doi: 10.18637/jss.v083.i13

Modha DS, Spangler WS; Feature Weighting in k-Means Clustering. Machine Learning, 52(3). 2003. doi: 10.1023/a:1024016609528

Examples

## Not run: 
# Generate toy data set with poor quality categorical variables and good
# quality continuous variables.
set.seed(1)
dat <- genMixedData(200, nConVar=2, nCatVar=2, nCatLevels=4, nConWithErr=2,
  nCatWithErr=2, popProportions=c(.5,.5), conErrLev=0.3, catErrLev=0.8)
catDf <- dummyCodeFactorDf(data.frame(apply(dat$catVars, 2, factor), stringsAsFactors = TRUE))
conDf <- data.frame(scale(dat$conVars), stringsAsFactors = TRUE)

msRes <- gmsClust(conDf, catDf, nclust=2)

table(msRes$results$cluster, dat$trueID)

## End(Not run)

[Package kamila version 0.1.2 Index]