stabilitySechidis {stabm} | R Documentation |
Stability Measure Sechidis
Description
The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.
Usage
stabilitySechidis(features, sim.mat, threshold = 0.9, impute.na = NULL)
Arguments
features |
|
sim.mat |
|
threshold |
|
impute.na |
|
Details
The stability measure is defined as
1 - \frac{\mathop{\mathrm{trace}}(CS)}{\mathop{\mathrm{trace}}(C \Sigma)}
with (p \times p
)-matrices
(S)_{ij} = \frac{m}{m-1}\left(\frac{h_{ij}}{m} - \frac{h_i}{m} \frac{h_j}{m}\right)
and
(\Sigma)_{ii} = \frac{q}{mp} \left(1 - \frac{q}{mp}\right),
(\Sigma)_{ij} = \frac{\frac{1}{m} \sum_{i=1}^{m} |V_i|^2 - \frac{q}{m}}{p^2 - p} - \frac{q^2}{m^2 p^2}, i \neq j.
The matrix C
is created from matrix sim.mat
by setting all values of sim.mat
that are smaller
than threshold
to 0. If you want to C
to be equal to sim.mat
, use threshold = 0
.
Value
numeric(1)
Stability value.
Notation
For the definition of all stability measures in this package,
the following notation is used:
Let V_1, \ldots, V_m
denote the sets of chosen features
for the m
datasets, i.e. features
has length m
and
V_i
is a set which contains the i
-th entry of features
.
Furthermore, let h_j
denote the number of sets that contain feature
X_j
so that h_j
is the absolute frequency with which feature X_j
is chosen.
Analogously, let h_{ij}
denote the number of sets that include both X_i
and X_j
.
Also, let q = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i|
and V = \bigcup_{i=1}^m V_i
.
Note
This stability measure is not corrected for chance.
Unlike for the other stability measures in this R package, that are not corrected for chance,
for stabilitySechidis
, no correction.for.chance
can be applied.
This is because for stabilitySechidis
, no finite upper bound is known at the moment,
see listStabilityMeasures.
References
Sechidis K, Papangelou K, Nogueira S, Weatherall J, Brown G (2020). “On the Stability of Feature Selection in the Presence of Feature Correlations.” In Machine Learning and Knowledge Discovery in Databases, 327–342. Springer International Publishing. doi:10.1007/978-3-030-46150-8_20.
Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.
See Also
Examples
feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilitySechidis(features = feats, sim.mat = mat)