R: Low dimensional ADPROCLUS

adproclus_low_dim {adproclus}

R Documentation

Low dimensional ADPROCLUS

Description

Perform low dimensional additive profile clustering (ADPROCLUS) on object by variable data. Use case: data to cluster consists of a large set of variables, where it can be useful to interpret the cluster profiles in terms of a smaller set of components that represent the original variables well.

Usage

adproclus_low_dim(
  data,
  nclusters,
  ncomponents,
  start_allocation = NULL,
  nrandomstart = 3,
  nsemirandomstart = 3,
  save_all_starts = FALSE,
  seed = NULL
)

Arguments

`data`	Object-by-variable data matrix of class `matrix` or `data.frame`.
`nclusters`	Number of clusters to be used. Must be a positive integer.
`ncomponents`	Number of components (dimensions) to which the profiles should be restricted. Must be a positive integer.
`start_allocation`	Optional matrix of binary values as starting allocation for first run. Default is `NULL`.
`nrandomstart`	Number of random starts (see `get_random`). Can be zero. Increase for better results, though longer computation time. Some research finds 500 starts to be a useful reference.
`nsemirandomstart`	Number of semi-random starts (see `get_semirandom`)). Can be zero. Increase for better results, though longer computation time. Some research finds 500 starts to be a useful reference.
`save_all_starts`	logical. If `TRUE`, the results of all algorithm starts are returned. By default, only the best solution is retained.
`seed`	Integer. Seed for the random number generator. Default: NULL, meaning no reproducibility

Details

In this function, an extension by Depril et al. (2012) of Mirkins (1987, 1990) additive profile clustering method is used to obtain a low dimensional overlapping clustering model of the object by variable data provided by data. More precisely, the low dimensional ADPROCLUS model approximates an I \times J object by variable data matrix X by an I \times J model matrix M. For K overlapping clusters, M can be decomposed into an I \times K binary cluster membership matrix A and a K \times J real-valued cluster profile matrix P s.t. M = AP. With the simultaneous dimension reduction, P is restricted to be of reduced rank S < min(K,J), such that it can be decomposed into P = CB', with C a K \times S matrix and B a J \times S matrix. Now, a row in C represents the profile values associated with the respective cluster in terms of the S components, while the entries of B can be used to interpret the components in terms of the complete set of variables. In particular, the aim of an ADPROCLUS analysis is therefore, given a number of clusters K and a number of dimensions S, to estimate a model matrix M that reconstructs data matrix X as close as possible in a least squares sense and simultaneously reduce the dimensions of the data. For a detailed illustration of the low dimensional ADPROCLUS model and associated loss function, see Depril et al. (2012).

Warning: Computation time increases exponentially with increasing number of clusters, K. We recommend to determine the computation time of a single start for each specific dataset and K before increasing the number of starts.

Value

adproclus_low_dim() returns a list with the following components, which describe the best model (from the multiple starts):

model: matrix. The obtained overlapping clustering model M of the same size as data.
model_lowdim: matrix. The obtained low dimensional clustering model AC of size I \times S
A: matrix. The membership matrix A of the clustering model. Clusters are sorted by size.
P: matrix. The profile matrix P of the clustering model.
c: matrix. The profile values in terms of the low dimensional components.
B: Variables-by-components matrix. Base vectors connecting low dimensional components with original variables. matrix. Warning: for computing P use B'.
sse: numeric. The residual sum of squares of the clustering model, which is minimized by the ALS algorithm.
totvar: numeric. The total sum of squares of data.
explvar: numeric. The proportion of variance in data that is accounted for by the clustering model.
iterations: numeric. The number of iterations of the algorithm.
timer: numeric. The amount of time (in seconds) the complete algorithm ran for.
timer_one_run: numeric. The amount of time (in seconds) the relevant single start ran for.
initial_start: list. A list containing the initial membership matrix, as well as the type of start that was used to obtain the clustering solution. (as returned by get_random or get_semirandom)
runs: list. Each element represents one model obtained from one of the multiple starts. Each element contains all of the above information.
parameters: list. Containing the parameters used for the model.

References

Depril, D., Van Mechelen, I., & Wilderjans, T. F. (2012). Lowdimensional additive overlapping clustering. Journal of classification, 29, 297-320.

Examples

# Loading a test dataset into the global environment
x <- stackloss

# Low dimensional clustering with K = 3 clusters
# where the resulting profiles can be characterized in S = 1 dimensions
clust <- adproclus_low_dim(x, 3, ncomponents = 1)

[Package adproclus version 1.0.2 Index]