find_defining_features {mixdir} | R Documentation |
Find the n defining features
Description
Reduce the dimensionality of a dataset by calculating how important each feature is for inferring the clustering.
Usage
find_defining_features(mixdir_obj, X, n_features = Inf,
measure = c("JS", "ARI"), subsample_size = Inf, step_size = Inf,
exponential_decay = TRUE, verbose = FALSE)
Arguments
mixdir_obj |
the result from a call to |
X |
the original dataset that was used for clustering. |
n_features |
the number of dimensions that should be selected. If it is
|
measure |
The measure used to assess the loss of clustering quality
if a variable is removed. Two measures are implemented: "JS" short for
Jensen-Shannon divergence comparing the original class probabilities
and the new predicted class probabilities (smaller is better),
"ARI" short for adjusted Rand index compares the overlap of the original
and the predicted classes (requires the |
subsample_size |
Running this method on the full dataset can be slow, but one can easily speed up the calculation by randomly selecting a subset of rows from X without usually disproportionately hurting the selection performance. |
step_size |
The method can either remove each feature individually
and return the n features that caused the greatest quality loss
( |
exponential_decay |
Boolean or number. Alternative way of
calculating how many features to remove each step. The default is
to always remove the least important 50% of the features
( |
verbose |
Boolean indicating if status messages should be printed. |
Details
Iteratively find the variable, whose removal least affects the
clustering compared with the original. If n_features
is a finite number
the quality is a single number and reflects how good those n features maintain
the original clustering. If n_features=Inf
, the method returns all features
ordered by decreasing importance. The accompanying quality vector contains the
"cumulative" loss if the corresponding variable would be removed.
Note that depending on the step size scheme the quality can differ. For example
if all variables are removed in one step (step_size=Inf
and
exponential_decay=FALSE
) the quality is not cumulative, but simply the
quality of the clustering excluding the corresponding feature. In that
sense the quality vector should not be used as a definitive answer, but
should only be used as a guidance to see where there are jumps in the quality.
See Also
find_predictive_features
find_typical_features
Examples
data("mushroom")
res <- mixdir(mushroom[1:100, ], n_latent=20)
find_defining_features(res, mushroom[1:100, ], n_features=3)
find_defining_features(res, mushroom[1:100, ], n_features=Inf)