adproclus_low_dim {adproclus}R Documentation

Low dimensional ADPROCLUS

Description

Perform low dimensional additive profile clustering (ADPROCLUS) on object by variable data. Use case: data to cluster consists of a large set of variables, where it can be useful to interpret the cluster profiles in terms of a smaller set of components that represent the original variables well.

Usage

adproclus_low_dim(
  data,
  nclusters,
  ncomponents,
  start_allocation = NULL,
  nrandomstart = 3,
  nsemirandomstart = 3,
  save_all_starts = FALSE,
  seed = NULL
)

Arguments

data

Object-by-variable data matrix of class matrix or data.frame.

nclusters

Number of clusters to be used. Must be a positive integer.

ncomponents

Number of components (dimensions) to which the profiles should be restricted. Must be a positive integer.

start_allocation

Optional matrix of binary values as starting allocation for first run. Default is NULL.

nrandomstart

Number of random starts (see get_random). Can be zero. Increase for better results, though longer computation time. Some research finds 500 starts to be a useful reference.

nsemirandomstart

Number of semi-random starts (see get_semirandom)). Can be zero. Increase for better results, though longer computation time. Some research finds 500 starts to be a useful reference.

save_all_starts

logical. If TRUE, the results of all algorithm starts are returned. By default, only the best solution is retained.

seed

Integer. Seed for the random number generator. Default: NULL, meaning no reproducibility

Details

In this function, an extension by Depril et al. (2012) of Mirkins (1987, 1990) additive profile clustering method is used to obtain a low dimensional overlapping clustering model of the object by variable data provided by data. More precisely, the low dimensional ADPROCLUS model approximates an I \times J object by variable data matrix X by an I \times J model matrix M. For K overlapping clusters, M can be decomposed into an I \times K binary cluster membership matrix A and a K \times J real-valued cluster profile matrix P s.t. M = AP. With the simultaneous dimension reduction, P is restricted to be of reduced rank S < min(K,J), such that it can be decomposed into P = CB', with C a K \times S matrix and B a J \times S matrix. Now, a row in C represents the profile values associated with the respective cluster in terms of the S components, while the entries of B can be used to interpret the components in terms of the complete set of variables. In particular, the aim of an ADPROCLUS analysis is therefore, given a number of clusters K and a number of dimensions S, to estimate a model matrix M that reconstructs data matrix X as close as possible in a least squares sense and simultaneously reduce the dimensions of the data. For a detailed illustration of the low dimensional ADPROCLUS model and associated loss function, see Depril et al. (2012).

Warning: Computation time increases exponentially with increasing number of clusters, K. We recommend to determine the computation time of a single start for each specific dataset and K before increasing the number of starts.

Value

adproclus_low_dim() returns a list with the following components, which describe the best model (from the multiple starts):

model

matrix. The obtained overlapping clustering model M of the same size as data.

model_lowdim

matrix. The obtained low dimensional clustering model AC of size I \times S

A

matrix. The membership matrix A of the clustering model. Clusters are sorted by size.

P

matrix. The profile matrix P of the clustering model.

c

matrix. The profile values in terms of the low dimensional components.

B

Variables-by-components matrix. Base vectors connecting low dimensional components with original variables. matrix. Warning: for computing P use B'.

sse

numeric. The residual sum of squares of the clustering model, which is minimized by the ALS algorithm.

totvar

numeric. The total sum of squares of data.

explvar

numeric. The proportion of variance in data that is accounted for by the clustering model.

iterations

numeric. The number of iterations of the algorithm.

timer

numeric. The amount of time (in seconds) the complete algorithm ran for.

timer_one_run

numeric. The amount of time (in seconds) the relevant single start ran for.

initial_start

list. A list containing the initial membership matrix, as well as the type of start that was used to obtain the clustering solution. (as returned by get_random or get_semirandom)

runs

list. Each element represents one model obtained from one of the multiple starts. Each element contains all of the above information.

parameters

list. Containing the parameters used for the model.

References

Depril, D., Van Mechelen, I., & Wilderjans, T. F. (2012). Lowdimensional additive overlapping clustering. Journal of classification, 29, 297-320.

See Also

adproclus

for full dimensional ADPROCLUS

get_random

for generating random starts

get_semirandom

for generating semi-random starts

get_rational

for generating rational starts

Examples

# Loading a test dataset into the global environment
x <- stackloss

# Low dimensional clustering with K = 3 clusters
# where the resulting profiles can be characterized in S = 1 dimensions
clust <- adproclus_low_dim(x, 3, ncomponents = 1)


[Package adproclus version 1.0.2 Index]