R: Pooled variable scaling for cluster analysis

PVS {clusterHD}

R Documentation

Pooled variable scaling for cluster analysis

Description

The function computes a scale for each variable in the data. The result can then be used to standardize a dataset before applying a clustering algorithm (such as k-means). The scale estimation is based on pooled scale estimators, which result from clustering the individual variables in the data. The method is proposed in Raymaekers, and Zamar (2020) <doi:10.1093/bioinformatics/btaa243>.

Usage

PVS(X, kmax = 3, dist = "euclidean",
    method = "gap", B = 1000,
    gapMethod = "firstSEmax",
    minSize = 0.05, rDist = runif,
    SE.factor = 1, refDist = NULL)

Arguments

`X`	an `n` by `p` data matrix.
`kmax`	maximum number of clusters in one variable. Default is `3`.
`dist`	`"euclidean"` for pooled standard deviation and `"manhattan"` for pooled mean absolute deviation. Default is `"euclidean"`.
`method`	either `"gap"` or `"jump"` to determine the number of clusters. Default is `"gap"`.
`B`	number of bootstrap samples for the reference distribution of the gap statistic. Default is `1000`.
`gapMethod`	method to define number of clusters in the gap statistic. See `cluster::maxSE` for more info. Defaults to `"firstSEmax"`.
`minSize`	minimum cluster size as a percentage of the total number of observations. Defaults to `0.05`.
`rDist`	Optional. Reference distribution (as a function) for the gap statistic. Defaults to `runif`, the uniform distribution.
`SE.factor`	factor for determining number of clusters when using the gap statistic. See `cluster::maxSE` for more details. Defaults to `1`
`refDist`	Optional. A `k` by `2` matrix with the mean and standard error of the reference distribution of the gap statistic in its columns. Can be used to avoid bootstrapping when repeatedly applying the function to same size data.

Value

A vector of length p containing the estimated scales for the variables.

Author(s)

Jakob Raymaekers

References

Raymaekers, J, Zamar, R.H. (2020). Pooled variable scaling for cluster analysis. Bioinformatics, 36(12), 3849-3855. doi: 10.1093/bioinformatics/btaa243

Examples



X <- iris[, -5]
y <- unclass(iris[, 5])

# Compute scales using different scale estimators.
# the pooled standard deviation is considerably smaller for variable 3 and 4:
sds     <- apply(X, 2, sd); round(sds, 2)
ranges  <- apply(X, 2, function(y) diff(range(y))); round(ranges, 2)
psds    <- PVS(X); round(psds, 2)

# Now cluster using k-means after scaling the data

nbclus <- 3
kmeans.std <- kmeans(X, nbclus, nstart = 100) # no scaling
kmeans.sd  <- kmeans(scale(X), nbclus, nstart = 100)
kmeans.rg  <- kmeans(scale(X, scale = ranges), nbclus, nstart = 100)
kmeans.psd <- kmeans(scale(X, scale = psds), nbclus, nstart = 100)

# Calculate the Adjusted Rand Index for each of the clustering outcomes
round(mclust::adjustedRandIndex(y, kmeans.std$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.sd$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.rg$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.psd$cluster), 2)

[Package clusterHD version 1.0.2 Index]