varclust {varclust} | R Documentation |
Variable Clustering with Multiple Latent Components Clustering algorithm
Description
Package varclust performs clustering of variables, according to a probabilistic model, which assumes that each cluster lies in a low dimensional subspace. Segmentation of variables, number of clusters and their dimensions are selected based on the appropriate implementation of the Bayesian Information Criterion.
Details
The best candidate models are identified by the specific implementation of K-means algorithm, in which cluster centers are represented by some number of orthogonal factors(principal components of the variables within a cluster) and similarity between a given variable and a cluster center depends on residuals from a linear model fit. Based on the Bayesian Information Criterion (BIC), sums of squares of residuals are appropriately scaled, which allows to avoid an over-excessive attraction by clusters with larger dimensions. To reduce the chance that the local minimum of modified BIC (mBIC) is obtained instead of the global one, for every fixed number of clusters in a given range K-means algorithm is run large number of times, with different random initializations of cluster centers.
The main function of package varclust is mlcc.bic
which
allows clustering variables in a data with unknown number of clusters.
Variable partition is computed with k-means based algorithm. Number of
clusters and their dimensions are estimated using mBIC and PESEL
respectively. If the number of clusters is known one might use function
mlcc.reps
, which takes number of clusters as a parameter. For
mlcc.reps
one might specify as well some initial segmentation
for k-means algorithm. This can be useful if user has some a priori knowledge
about clustering.
We provide also two functions to simulate datasets with described structure.
The function data.simulation
generates the data so that the
subspaces are indepentend and data.simulation.factors
generates
the data where some factores are shared between the subspaces.
We also provide function measures of quality of clustering.
misclassification
computes misclassification rate between two
partitions. This performance measure is extensively used in image
segmentation. The other measure is implemented as integration
function.
Version: 0.9.4
Author(s)
Piotr Sobczyk, Stanislaw Wilczynski, Julie Josse, Malgorzata Bogdan
Maintainer: Piotr Sobczyk pj.sobczyk@gmail.com
Examples
sim.data <- data.simulation(n = 50, SNR = 1, K = 3, numb.vars = 50, max.dim = 3)
mlcc.bic(sim.data$X, numb.clusters = 1:5, numb.runs = 20, numb.cores = 1, verbose = TRUE)
mlcc.reps(sim.data$X, numb.clusters = 3, numb.runs = 20, numb.cores = 1)