R: Clustering with Disjoint Principal Components Analysis

dpcakm {drclust}

R Documentation

Clustering with Disjoint Principal Components Analysis

Description

Performs simultaneously k-means partitioning on units and disjoint PCA on the variables, computing each principal component from a different subset of variables. The result is a simplified, easier to interpret loading matrix A, the principal components and the clustering. The reduced subspace is identified by the centroids.

Usage

dpcakm(X, K, Q, Rndstart, verbose, maxiter, tol, constr, print, prep)

Arguments

`X`	Units x variables numeric data matrix.
`K`	Number of clusters for the units.
`Q`	Number of principal components.
`Rndstart`	Number of runs to be performed (Defaults is 20).
`verbose`	Outputs basic summary statistics for each run (1 = enabled; 0 = disabled, default option).
`maxiter`	Maximum number of iterations allowed (if convergence is not yet reached. Default is 100).
`tol`	Tolerance threshold (maximum difference between the values of the objective function of two consecutive iterations such that convergence is assumed. Default is 1e-6).
`constr`	is a vector of length J = nr. of variables, pre-specifying to which cluster some of the variables must be assigned. Each component of the vector can assume integer values from 1 o Q = nr. of variable-cluster / principal components (See examples for more details), or 0 if no constraint on the variable is imposed (i.e., it will be assigned based on the plain algorithm).
`print`	Prints summary statistics of the results (1 = enabled; 0 = disabled, default option).
`prep`	Pre-processing of the data. 1 performs the z-score transform (default choice); 2 performs the min-max transform; 0 leaves the data un-pre-processed.

Value

returns a list of estimates and some descriptive quantities of the final results.

`V`	Variables x factors membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which cluster each variable has been assigned.
`U`	Units x clusters membership matrix (binary and row-stochastic). Each row is a dummy variable indicating to which cluster each unit has been assigned.
`A`	Variables x components loading matrix.
`centers`	K x Q matrix of centers containing the row means expressed in the reduced space of Q principal components.
`totss`	The total sum of squares (scalar).
`withinss`	Vector of within-cluster sum of squares, one component per cluster.
`betweenss`	Amount of deviance captured by the model (scalar).
`K-size`	Number of units assigned to each row-cluster (vector).
`Q-size`	Number of variables assigned to each column-cluster (vector).
`pseudoF`	Calinski-Harabasz index of the resulting partition (scalar).
`loop`	The index of the (best) run from which the results have been chosen.
`it`	the number of iterations performed during the (best) run.

Author(s)

Ionel Prunila, Maurizio Vichi

References

Vichi M., Saporta G. (2009) "Clustering and disjoint principal component analysis" <doi:10.1016/j.csda.2008.05.028>

Examples

# Iris data 
# Loading the numeric variables of iris data
iris <- as.matrix(iris[,-5]) 

# No constraint on variables
out <- dpcakm(iris, K = 3, Q = 2, Rndstart = 5)

# Constraint: the first two variables must contribute to the same factor.
outc <- dpcakm(iris, K = 3, Q = 2, Rndstart = 5,constr = c(1,1,0,0))

[Package drclust version 0.1 Index]