dist_freqwH {MEDseq} | R Documentation |
Pairwise frequency-Weighted Hamming distance matrix for categorical data
Description
Computes the matrix of pairwise distance using a frequency-weighted variant of the Hamming distance often used in k-modes clustering.
Usage
dist_freqwH(data,
full.matrix = TRUE)
Arguments
data |
A matrix or data frame of categorical data. Objects have to be in rows, variables in columns. |
full.matrix |
Logical. If |
Details
As per wKModes
, the frequency weights are computed within the function and are not user-specified. These frequency weights are assigned on a per-feature basis and derived from the categories represented in each column of data
.
Value
The whole matrix of pairwise distances if full.matrix=TRUE
, otherwise the corresponding dist
object.
Author(s)
Keefe Murphy - <keefe.murphy@mu.ie>
References
Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283-304.
See Also
wKModes
, wcAggregateCases
, wcSilhouetteObs
Examples
suppressMessages(require(WeightedCluster))
set.seed(99)
# Load the MVAD data & aggregate the state sequences
data(mvad)
agg <- wcAggregateCases(mvad[,17:86], weights=mvad$weight)
# Create a state sequence object without the first two (summer) time points
states <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels <- c("Employment", "Further Education", "Higher Education",
"Joblessness", "School", "Training")
weights <- agg$aggWeights
mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86],
states=states, labels=labels, weights=agg$aggWeights)
# Run k-modes with weights
resW <- wKModes(mvad.seq, 2, weights=agg$aggWeights)
# Run k-modes with additional frequency weights
resF <- wKModes(mvad.seq, 2, weights=agg$aggWeights, freq.weighted=TRUE)
# Examine the average silhouette widths of both weighted solutions
weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resW$cluster, weights), weights)
# weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resF$cluster, weights), weights)
weighted.mean(wcSilhouetteObs(dist_freqwH(mvad.seq), resF$cluster, weights), weights)