dist_freqwH {MEDseq}R Documentation

Pairwise frequency-Weighted Hamming distance matrix for categorical data

Description

Computes the matrix of pairwise distance using a frequency-weighted variant of the Hamming distance often used in k-modes clustering.

Usage

dist_freqwH(data,
            full.matrix = TRUE)

Arguments

data

A matrix or data frame of categorical data. Objects have to be in rows, variables in columns.

full.matrix

Logical. If TRUE (the default), the full pairwise distance matrix is returned, otherwise an object of class dist is returned, i.e. a vector containing only values from the upper triangle of the distance matrix. Objects of class dist are smaller and can be passed directly as arguments to most clustering functions.

Details

As per wKModes, the frequency weights are computed within the function and are not user-specified. These frequency weights are assigned on a per-feature basis and derived from the categories represented in each column of data.

Value

The whole matrix of pairwise distances if full.matrix=TRUE, otherwise the corresponding dist object.

Author(s)

Keefe Murphy - <keefe.murphy@mu.ie>

References

Huang, Z. (1998). Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3): 283-304.

See Also

wKModes, wcAggregateCases, wcSilhouetteObs

Examples

suppressMessages(require(WeightedCluster))
set.seed(99)
# Load the MVAD data & aggregate the state sequences
data(mvad)
agg      <- wcAggregateCases(mvad[,17:86], weights=mvad$weight)

# Create a state sequence object without the first two (summer) time points
states   <- c("EM", "FE", "HE", "JL", "SC", "TR")
labels   <- c("Employment", "Further Education", "Higher Education", 
              "Joblessness", "School", "Training")
weights  <- agg$aggWeights
mvad.seq <- seqdef(mvad[agg$aggIndex, 17:86], 
                   states=states, labels=labels, weights=agg$aggWeights)

# Run k-modes with weights
resW     <- wKModes(mvad.seq, 2, weights=agg$aggWeights)

# Run k-modes with additional frequency weights
resF     <- wKModes(mvad.seq, 2, weights=agg$aggWeights, freq.weighted=TRUE)

# Examine the average silhouette widths of both weighted solutions
weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resW$cluster, weights), weights)
# weighted.mean(wcSilhouetteObs(seqdist(mvad.seq, method="HAM"), resF$cluster, weights), weights)
weighted.mean(wcSilhouetteObs(dist_freqwH(mvad.seq), resF$cluster, weights), weights)

[Package MEDseq version 1.4.1 Index]