wKModes {MEDseq} | R Documentation |
Weighted K-Modes Clustering with Tie-Breaking
Description
Perform k-modes clustering on categorical data with observation-specific sampling weights and tie-breaking adjustments.
Usage
wKModes(data,
modes,
weights = NULL,
iter.max = .Machine$integer.max,
freq.weighted = FALSE,
fast = TRUE,
random = TRUE,
...)
Arguments
data |
A matrix or data frame of categorical data. Objects have to be in rows, variables in columns. |
modes |
Either the number of modes or a set of initial (distinct) cluster modes (where each mode is a row and |
weights |
Optional numeric vector containing non-negative observation-specific case weights. |
iter.max |
The maximum number of iterations allowed. Defaults to |
freq.weighted |
A logical indicating whether the usual simple-matching (Hamming) distance between objects is used, or a frequency weighted version of this distance. Defaults to |
fast |
A logical indicating whether a fast version of the algorithm should be applied. Defaults to |
random |
A logical indicating whether ties for the modal values &/or assignments are broken at random. Defaults to Regarding the modes, ties are broken at random when |
... |
Catches unused arguments. |
Details
The k-modes algorithm (Huang, 1998) is an extension of the k-means algorithm by MacQueen (1967).
The data given by data
is clustered by the k-modes method (Huang, 1998) which aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimised.
By default, the simple-matching (Hamming) distance is used to determine the dissimilarity of two objects. It is computed by counting the number of mismatches in all variables. Alternatively, this distance can be weighted by the frequencies of the categories in data, using the freq.weighted
argument (see Huang, 1998, for details). For convenience, the function dist_freqwH
is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.
If an initial matrix of modes is supplied, it is possible that no object will be closest to one or more modes. In this case, fewer clusters than the number of supplied modes will be returned and a warning will be printed.
If called using fast = TRUE
, the reassignment of the data to clusters is done for the entire data set before recomputation of the modes is done. For computational reasons, this option should be chosen for all but the most moderate of data sizes.
Value
An object of class "wKModes"
which is a list with the following components:
cluster
A vector of integers indicating the cluster to which each object is allocated.
size
The number of objects in each cluster.
modes
A matrix of cluster modes.
withindiff
The within-cluster (weighted) simple-matching distance for each cluster.
tot.withindiff
The total within-cluster (weighted) distance over all clusters.
tot.withindiff
can be used to guide the choice of the number of clusters, but beware of inherent randomness in the algorithm, which is liable to yield a jagged elbow plot (see examples).iterations
The number of iterations the algorithm reached.
weighted
A logical indicating whether observation-level
weights
were used or not throughout the algorithm.freq.weighted
A logical indicating whether feature-level
freq.weights
were used or not in the computation of the distances. For convenience, the functiondist_freqwH
is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.random
A logical indicating whether ties were broken at random or not throughout the algorithm.
Note
This code is adapted from the kmodes
function in the klaR package. Specifically, modifications were made to allow for random tie-breaking for the modes and assignments (see random
above) and the incorporation of observation-specific sampling weights
, with a view to using this function as a means to initialise the allocations for MEDseq models (see the MEDseq_control
argument init.z
and the related options "kmodes"
and "kmodes2"
).
Notably, the wKModes
function, when invoked inside MEDseq_fit
, is used regardless of whether the weights are true sampling weights, or the weights are merely aggregation weights, or there are no weights at all. Furthermore, the MEDseq_control
argument random
is also passed to wKModes
when it is invoked inside MEDseq_fit
.
Update: as of version 1.7-1
of klaR, klaR::kmodes
now breaks assignment ties at random only when fast=TRUE
. It still breaks assignment ties when fast=FALSE
and all ties for modal values in the non-random manner described above. Thus, the old behaviour of klaR::kmodes
can be recovered by specifying random=FALSE
here, but random=TRUE
allows random tie-breaking for both types of ties in all situations.
Author(s)
Keefe Murphy - <keefe.murphy@mu.ie>
(adapted from klaR::kmodes
)
References
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283-304.
MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. L. Cam and J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, June 21-July 18, 1965 and December 27 1965-January 7, 1966, Statistical Laboratory of the University of California, Berkelely, CA, USA, pp. 281-297. University of California Press.
See Also
MEDseq_control
, MEDseq_fit
, dist_freqwH
, wcAggregateCases
, seqformat
Examples
suppressMessages(require(WeightedCluster))
set.seed(99)
# Load the MVAD data & aggregate the state sequences
data(mvad)
agg <- wcAggregateCases(mvad[,17:86], weights=mvad$weight)
# Create a state sequence object without the first two (summer) time points
states <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels <- c("Employment", "Further Education", "Higher Education",
"Joblessness", "School", "Training")
mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86],
states=states, labels=labels,
weights=agg$aggWeights)
# Run k-modes without the weights
resX <- wKModes(mvad.seq, 2)
# Run k-modes with the weights
resW <- wKModes(mvad.seq, 2, weights=agg$aggWeights)
# Examine the modal sequences of both solutions
seqformat(seqdef(resX$modes), from="STS", to="SPS", compress=TRUE)
seqformat(seqdef(resW$modes), from="STS", to="SPS", compress=TRUE)
# Using tot.withindiff to choose the number of clusters
TWdiffs <- sapply(1:5, function(k) wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff)
plot(TWdiffs, type="b", xlab="K")
# Use multiple random starts to account for inherent randomness
TWDiff <- sapply(1:5, function(k) min(replicate(10,
wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff)))
plot(TWDiff, type="b", xlab="K")