R: Pairwise frequency-Weighted Hamming distance matrix for...

dist_freqwH {MEDseq}

R Documentation

Pairwise frequency-Weighted Hamming distance matrix for categorical data

Description

Computes the matrix of pairwise distance using a frequency-weighted variant of the Hamming distance often used in k-modes clustering.

Usage

dist_freqwH(data,
            full.matrix = TRUE)

Arguments

`data`	A matrix or data frame of categorical data. Objects have to be in rows, variables in columns.
`full.matrix`	Logical. If `TRUE` (the default), the full pairwise distance matrix is returned, otherwise an object of class `dist` is returned, i.e. a vector containing only values from the upper triangle of the distance matrix. Objects of class `dist` are smaller and can be passed directly as arguments to most clustering functions.

Details

As per wKModes, the frequency weights are computed within the function and are not user-specified. These frequency weights are assigned on a per-feature basis and derived from the categories represented in each column of data.

Value

The whole matrix of pairwise distances if full.matrix=TRUE, otherwise the corresponding dist object.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie>

References

Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283-304.

Examples

suppressMessages(require(WeightedCluster))
set.seed(99)
# Load the MVAD data & aggregate the state sequences
data(mvad)
agg      <- wcAggregateCases(mvad[,17:86], weights=mvad$weight)

# Create a state sequence object without the first two (summer) time points
states   <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels   <- c("Employment", "Further Education", "Higher Education", 
              "Joblessness", "School", "Training")
weights  <- agg$aggWeights
mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], 
                   states=states, labels=labels, weights=agg$aggWeights)

# Run k-modes with weights
resW     <- wKModes(mvad.seq, 2, weights=agg$aggWeights)

# Run k-modes with additional frequency weights
resF     <- wKModes(mvad.seq, 2, weights=agg$aggWeights, freq.weighted=TRUE)

# Examine the average silhouette widths of both weighted solutions
weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resW$cluster, weights), weights)
# weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resF$cluster, weights), weights)
weighted.mean(wcSilhouetteObs(dist_freqwH(mvad.seq), resF$cluster, weights), weights)

[Package MEDseq version 1.4.1 Index]