R: Find the n defining features

find_defining_features {mixdir}

R Documentation

Find the n defining features

Description

Reduce the dimensionality of a dataset by calculating how important each feature is for inferring the clustering.

Usage

find_defining_features(mixdir_obj, X, n_features = Inf,
  measure = c("JS", "ARI"), subsample_size = Inf, step_size = Inf,
  exponential_decay = TRUE, verbose = FALSE)

Arguments

`mixdir_obj`	the result from a call to `mixdir()`. It needs to have the fields category_prob. category_prob a list of a list of a named vector with probabilities for each feature, latent class and possible category.
`X`	the original dataset that was used for clustering.
`n_features`	the number of dimensions that should be selected. If it is `Inf` (the default) all features are returned ordered by importance (most important first).
`measure`	The measure used to assess the loss of clustering quality if a variable is removed. Two measures are implemented: "JS" short for Jensen-Shannon divergence comparing the original class probabilities and the new predicted class probabilities (smaller is better), "ARI" short for adjusted Rand index compares the overlap of the original and the predicted classes (requires the `mcclust` package) (1 is perfect, 0 is as good as random).
`subsample_size`	Running this method on the full dataset can be slow, but one can easily speed up the calculation by randomly selecting a subset of rows from X without usually disproportionately hurting the selection performance.
`step_size`	The method can either remove each feature individually and return the n features that caused the greatest quality loss (`step=Inf`) or iteratively remove the least important one until the the size of the remaining features equal `n_features` (`step=1`). Using a smaller step size increases the sensitivity of the selection process, but takes longer to calculate.
`exponential_decay`	Boolean or number. Alternative way of calculating how many features to remove each step. The default is to always remove the least important 50% of the features (`exponential_decay=2`).
`verbose`	Boolean indicating if status messages should be printed.

Details

Iteratively find the variable, whose removal least affects the clustering compared with the original. If n_features is a finite number the quality is a single number and reflects how good those n features maintain the original clustering. If n_features=Inf, the method returns all features ordered by decreasing importance. The accompanying quality vector contains the "cumulative" loss if the corresponding variable would be removed. Note that depending on the step size scheme the quality can differ. For example if all variables are removed in one step (step_size=Inf and exponential_decay=FALSE) the quality is not cumulative, but simply the quality of the clustering excluding the corresponding feature. In that sense the quality vector should not be used as a definitive answer, but should only be used as a guidance to see where there are jumps in the quality.

Examples

  
  data("mushroom")
  res <- mixdir(mushroom[1:100, ], n_latent=20)
  find_defining_features(res, mushroom[1:100, ], n_features=3)
  find_defining_features(res, mushroom[1:100, ], n_features=Inf)