R: Weighted K-Modes Clustering with Tie-Breaking

wKModes {MEDseq}

R Documentation

Weighted K-Modes Clustering with Tie-Breaking

Description

Perform k-modes clustering on categorical data with observation-specific sampling weights and tie-breaking adjustments.

Usage

wKModes(data,
        modes,
        weights = NULL,
        iter.max = .Machine$integer.max,
        freq.weighted = FALSE,
        fast = TRUE,
        random = TRUE,
        ...)

Arguments

`data`	A matrix or data frame of categorical data. Objects have to be in rows, variables in columns.
`modes`	Either the number of modes or a set of initial (distinct) cluster modes (where each mode is a row and `modes` has the same number of columns as `data`). If a number, a random set of (distinct) rows in `data` is chosen as the initial modes. Note, this randomness is always present, and is not governed by `random` below.
`weights`	Optional numeric vector containing non-negative observation-specific case weights.
`iter.max`	The maximum number of iterations allowed. Defaults to `.Machine$integer.max`. The algorithm terminates when `iter.max` is reached or when the partition ceases to change between iterations.
`freq.weighted`	A logical indicating whether the usual simple-matching (Hamming) distance between objects is used, or a frequency weighted version of this distance. Defaults to `FALSE`; when `TRUE`, the frequency weights are computed within the algorithm and are not user-specified. Distinct from the observation-level `weights` above, the frequency weights are assigned on a per-feature basis and derived from the categories represented in each column of `data`. For convenience, the function `dist_freqwH` is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.
`fast`	A logical indicating whether a fast version of the algorithm should be applied. Defaults to `TRUE`.
`random`	A logical indicating whether ties for the modal values &/or assignments are broken at random. Defaults to `TRUE`: the implied default had been `FALSE` prior to version `1.3.2` of this package, as per `klaR::kmodes` prior to version `1.7-1` (see Note). Note that when `modes` is specified as the number of modes, the algorithm is always randomly initialised, regardless of the specification of `random`. Regarding the modes, ties are broken at random when `TRUE` and the first candidate state is always chosen for the mode when `FALSE`. Regarding assignments, tie-breaking is always first biased in favour of the observation's most recent cluster: regarding ties thereafter, these are broken at random when `TRUE` or the first other candidate cluster is always chosen when `FALSE`.
`...`	Catches unused arguments.

Details

The k-modes algorithm (Huang, 1998) is an extension of the k-means algorithm by MacQueen (1967).

The data given by data is clustered by the k-modes method (Huang, 1998) which aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimised.

By default, the simple-matching (Hamming) distance is used to determine the dissimilarity of two objects. It is computed by counting the number of mismatches in all variables. Alternatively, this distance can be weighted by the frequencies of the categories in data, using the freq.weighted argument (see Huang, 1998, for details). For convenience, the function dist_freqwH is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.

If an initial matrix of modes is supplied, it is possible that no object will be closest to one or more modes. In this case, fewer clusters than the number of supplied modes will be returned and a warning will be printed.

If called using fast = TRUE, the reassignment of the data to clusters is done for the entire data set before recomputation of the modes is done. For computational reasons, this option should be chosen for all but the most moderate of data sizes.

Value

An object of class "wKModes" which is a list with the following components:

cluster: A vector of integers indicating the cluster to which each object is allocated.
size: The number of objects in each cluster.
modes: A matrix of cluster modes.
withindiff: The within-cluster (weighted) simple-matching distance for each cluster.
tot.withindiff: The total within-cluster (weighted) distance over all clusters. tot.withindiff can be used to guide the choice of the number of clusters, but beware of inherent randomness in the algorithm, which is liable to yield a jagged elbow plot (see examples).
iterations: The number of iterations the algorithm reached.
weighted: A logical indicating whether observation-level weights were used or not throughout the algorithm.
freq.weighted: A logical indicating whether feature-level freq.weights were used or not in the computation of the distances. For convenience, the function dist_freqwH is provided for calculating the corresponding pairwise dissimilarity matrix for subsequent use.
random: A logical indicating whether ties were broken at random or not throughout the algorithm.

Note

This code is adapted from the kmodes function in the klaR package. Specifically, modifications were made to allow for random tie-breaking for the modes and assignments (see random above) and the incorporation of observation-specific sampling weights, with a view to using this function as a means to initialise the allocations for MEDseq models (see the MEDseq_control argument init.z and the related options "kmodes" and "kmodes2").

Notably, the wKModes function, when invoked inside MEDseq_fit, is used regardless of whether the weights are true sampling weights, or the weights are merely aggregation weights, or there are no weights at all. Furthermore, the MEDseq_control argument random is also passed to wKModes when it is invoked inside MEDseq_fit.

Update: as of version 1.7-1 of klaR, klaR::kmodes now breaks assignment ties at random only when fast=TRUE. It still breaks assignment ties when fast=FALSE and all ties for modal values in the non-random manner described above. Thus, the old behaviour of klaR::kmodes can be recovered by specifying random=FALSE here, but random=TRUE allows random tie-breaking for both types of ties in all situations.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie> (adapted from klaR::kmodes)

References

Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283-304.

MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. In L. M. L. Cam and J. Neyman (Eds.), Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1, June 21-July 18, 1965 and December 27 1965-January 7, 1966, Statistical Laboratory of the University of California, Berkelely, CA, USA, pp. 281-297. University of California Press.

Examples

suppressMessages(require(WeightedCluster))
set.seed(99)
# Load the MVAD data & aggregate the state sequences
data(mvad)
agg      <- wcAggregateCases(mvad[,17:86], weights=mvad$weight)

# Create a state sequence object without the first two (summer) time points
states   <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels   <- c("Employment", "Further Education", "Higher Education", 
              "Joblessness", "School", "Training")
mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], 
                   states=states, labels=labels, 
                   weights=agg$aggWeights)

# Run k-modes without the weights
resX     <- wKModes(mvad.seq, 2)

# Run k-modes with the weights
resW     <- wKModes(mvad.seq, 2, weights=agg$aggWeights)

# Examine the modal sequences of both solutions
seqformat(seqdef(resX$modes), from="STS", to="SPS", compress=TRUE)
seqformat(seqdef(resW$modes), from="STS", to="SPS", compress=TRUE)

# Using tot.withindiff to choose the number of clusters

TWdiffs   <- sapply(1:5, function(k) wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff)
plot(TWdiffs, type="b", xlab="K")

# Use multiple random starts to account for inherent randomness
TWDiff    <- sapply(1:5, function(k) min(replicate(10, 
                    wKModes(mvad.seq, k, weights=agg$aggWeights)$tot.withindiff)))
plot(TWDiff, type="b", xlab="K")

[Package MEDseq version 1.4.1 Index]