SMART {FACT}R Documentation

SMART - Scoring Metric after Permutation

Description

SMART estimates the importance of a feature to the clustering algorithm by measuring changes in cluster assignments by scoring functions after permuting selected feature. Cluster-specific SMART indicates the importance of specific clusters versus the remaining ones, measured by a binary scoring metric. Global SMART assigns importance scores across all clusters, measured by a multi-class scoring metric. Currently, SMART can only be used for hard label predictors.

Details

Let M \in \mathbb{N}_0^{k \times k} denote the multi-cluster confusion matrix and M_c \in \mathbb{N}_0^{2 \times 2} the binary confusion matrix for cluster c versus the remaining clusters. SMART for feature set S corresponds to:

\text{Multi-cluster scoring:} \quad \text{SMART}(X, \tilde{X}_S) = h_{\text{multi}}(M) \\ \text{Binary scoring:} \quad \text{SMART}(X, \tilde{X}_S) = \text{AVE}(h_{\text{binary}}(M_1), \dots, h_{\text{binary}}(M_k))

where \text{AVE} averages a vector of binary scores, e.g., via micro or macro averaging. In order to reduce variance in the estimate from shuffling the data, one can shuffle t times and evaluate the distribution of scores. Let \tilde{X}_S^{(t)} denote the t-th shuffling iteration for feature set S. The SMART point estimate is given by:

\overline{\text{SMART}}(X, \tilde{X}_S) = \psi\left(\text{SMART}(X, \tilde{X}_S^{(1)}), \dots, \text{SMART}(X, \tilde{X}_S^{(t)})\right)

where \psi extracts a sample statistic such as the mean or median or quantile.

Public fields

avg

(character(1) or NULL)
NULL is calculating cluster-specific (binary) metrics. "micro" summarizes binary scores to a global score that treats each instance in the data set with equal importance. "macro" summarizes binary scores to a global score that treats each cluster with equal importance.

metric

character(1)
The binary similarity metric used.

predictor

ClustPredictor
The object (created with ClustPredictor$new()) holding the cluster algorithm and the data.

data.sample

data.frame
The data, including features and cluster soft/ hard labels.

sampler

any
Sampler from the predictor object.

features

(⁠character or list⁠)
Features/ feature sets to calculate importance scores.

n.repetitions

(numeric(1))
How often is the shuffling of the feature repeated?

results

(data.table)
A data.table containing the results from SMART procedure.

Methods

Public methods


Method new()

Create a SMART object

Usage
SMART$new(
  predictor,
  features = NULL,
  metric = "f1",
  avg = NULL,
  n.repetitions = 5
)
Arguments
predictor

ClustPredictor
The object (created with ClustPredictor$new()) holding the cluster algorithm and the data.

features

(⁠character or list⁠)
For which features do you want importance scores calculated. The default value of NULL implies all features. Use a named list of character vectors to define groups of features for which joint importance will be calculated.

metric

character(1)
The binary similarity metric used. Defaults to f1, where F1 Score is used. Other possible binary scores are "precision", "recall", "jaccard", "folkes_mallows" and "accuracy".

avg

(character(1) or NULL)
Either NULL, "micro" or "macro". Defaults to NULL is calculating cluster-specific (binary) metrics. "micro" summarizes binary scores to a global score that treats each instance in the data set with equal importance. "macro" summarizes binary scores to a global score that treats each cluster with equal importance. For unbalanced clusters, "macro" is more recommendable.

n.repetitions

(numeric(1))
How often should the shuffling of the feature be repeated? The higher the number of repetitions the more stable and accurate the results become.

Returns

(data.frame)
data.frame with the results of the feature importance computation. One row per feature with the following columns: For global scores:


Method print()

Print a SMART object

Usage
SMART$print()
Returns

character
Information about predictor, data, metric, and avg and head of the results.


Method plot()

plots the similarity score results of a SMART object.

Usage
SMART$plot(log = FALSE, single_cl = NULL)
Arguments
log

logical(1)
Indicator weather results should be logged. This can be useful to distinguish the importance if similarity scores are all close to 1.

single_cl

character(1)
Only used for cluster-specific scores (avg = NULL). Should match one of the cluster names. In this case, importance scores for a single cluster are plotted.

Details

The plot shows the similarity per feature. For global scores: When n.repetitions in SMART$new was larger than 1, then we get multiple similarity estimates per feature. The similarity are aggregated and the plot shows the median similarity per feature (as dots) and also the 90%-quantile, which helps to understand how much variance the computation has per feature. For cluster-specific scores: Stacks the similarity estimates of all clusters per feature. Can be used to achieve a global estimate as a sum of cluster-wise similarities.

Returns

ggplot2 plot object


Method clone()

The objects of this class are cloneable with this method.

Usage
SMART$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

See Also

iml::FeatureImp

SMART

SMART

Examples


# load data and packages
require(factoextra)
require(FuzzyDBScan)
multishapes = as.data.frame(multishapes[, 1:2])
# Set up an train FuzzyDBScan
eps = c(0, 0.2)
pts = c(3, 15)
res = FuzzyDBScan$new(multishapes, eps, pts)
res$plot("x", "y")
# create hard label predictor
predict_part = function(model, newdata) model$predict(new_data = newdata, cmatrix = FALSE)$cluster
predictor = ClustPredictor$new(res, as.data.frame(multishapes), y = res$clusters,
                               predict.function = predict_part, type = "partition")
# Run SMART globally
macro_f1 = SMART$new(predictor, n.repetitions = 50, metric = "f1", avg = "macro")
macro_f1 # print global SMART
macro_f1$plot(log = TRUE) # plot global SMART
# Run cluster specific SMART
classwise_f1 = SMART$new(predictor, n.repetitions = 50, metric = "f1")
macro_f1 # print regional SMART
macro_f1$plot(log = TRUE) # plot regional SMART


[Package FACT version 0.1.1 Index]