PVS {clusterHD}R Documentation

Pooled variable scaling for cluster analysis

Description

The function computes a scale for each variable in the data. The result can then be used to standardize a dataset before applying a clustering algorithm (such as k-means). The scale estimation is based on pooled scale estimators, which result from clustering the individual variables in the data. The method is proposed in Raymaekers, and Zamar (2020) <doi:10.1093/bioinformatics/btaa243>.

Usage

PVS(X, kmax = 3, dist = "euclidean",
    method = "gap", B = 1000,
    gapMethod = "firstSEmax",
    minSize = 0.05, rDist = runif,
    SE.factor = 1, refDist = NULL)

Arguments

X

an n by p data matrix.

kmax

maximum number of clusters in one variable. Default is 3.

dist

"euclidean" for pooled standard deviation and "manhattan" for pooled mean absolute deviation. Default is "euclidean".

method

either "gap" or "jump" to determine the number of clusters. Default is "gap".

B

number of bootstrap samples for the reference distribution of the gap statistic. Default is 1000.

gapMethod

method to define number of clusters in the gap statistic. See cluster::maxSE for more info. Defaults to "firstSEmax".

minSize

minimum cluster size as a percentage of the total number of observations. Defaults to 0.05.

rDist

Optional. Reference distribution (as a function) for the gap statistic. Defaults to runif, the uniform distribution.

SE.factor

factor for determining number of clusters when using the gap statistic. See cluster::maxSE for more details. Defaults to 1

refDist

Optional. A k by 2 matrix with the mean and standard error of the reference distribution of the gap statistic in its columns. Can be used to avoid bootstrapping when repeatedly applying the function to same size data.

Value

A vector of length p containing the estimated scales for the variables.

Author(s)

Jakob Raymaekers

References

Raymaekers, J, Zamar, R.H. (2020). Pooled variable scaling for cluster analysis. Bioinformatics, 36(12), 3849-3855. doi: 10.1093/bioinformatics/btaa243

Examples



X <- iris[, -5]
y <- unclass(iris[, 5])

# Compute scales using different scale estimators.
# the pooled standard deviation is considerably smaller for variable 3 and 4:
sds     <- apply(X, 2, sd); round(sds, 2)
ranges  <- apply(X, 2, function(y) diff(range(y))); round(ranges, 2)
psds    <- PVS(X); round(psds, 2)

# Now cluster using k-means after scaling the data

nbclus <- 3
kmeans.std <- kmeans(X, nbclus, nstart = 100) # no scaling
kmeans.sd  <- kmeans(scale(X), nbclus, nstart = 100)
kmeans.rg  <- kmeans(scale(X, scale = ranges), nbclus, nstart = 100)
kmeans.psd <- kmeans(scale(X, scale = psds), nbclus, nstart = 100)

# Calculate the Adjusted Rand Index for each of the clustering outcomes
round(mclust::adjustedRandIndex(y, kmeans.std$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.sd$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.rg$cluster), 2) 
round(mclust::adjustedRandIndex(y, kmeans.psd$cluster), 2)



[Package clusterHD version 1.0.2 Index]