stabilityYu {stabm} | R Documentation |
Stability Measure Yu
Description
The stability of feature selection is defined as the robustness of
the sets of selected features with respect to small variations in the data on which the
feature selection is conducted. To quantify stability, several datasets from the
same data generating process can be used. Alternatively, a single dataset can be
split into parts by resampling. Either way, all datasets used for feature selection must
contain exactly the same features. The feature selection method of interest is
applied on all of the datasets and the sets of chosen features are recorded.
The stability of the feature selection is assessed based on the sets of chosen features
using stability measures.
Usage
stabilityYu(
features,
sim.mat,
threshold = 0.9,
correction.for.chance = "estimate",
N = 10000,
impute.na = NULL
)
Arguments
features |
list (length >= 2)
Chosen features per dataset. Each element of the list contains the features for one dataset.
The features must be given by their names (character ) or indices (integerish ).
|
sim.mat |
numeric matrix
Similarity matrix which contains the similarity structure of all features based on
all datasets. The similarity values must be in the range of [0, 1] where 0 indicates
very low similarity and 1 indicates very high similarity. If the list elements of
features are integerish vectors, then the feature numbering must correspond to the
ordering of sim.mat . If the list elements of features are character
vectors, then sim.mat must be named and the names of sim.mat must correspond
to the entries in features .
|
threshold |
numeric(1)
Threshold for indicating which features are similar and which are not. Two features
are considered as similar, if and only if the corresponding entry of sim.mat is greater
than or equal to threshold .
|
correction.for.chance |
character(1)
How should the expected value of the stability score (see Details)
be assessed? Options are "estimate", "exact" and "none".
For "estimate", N random feature sets of the same sizes as the input feature
sets (features ) are generated.
For "exact", all possible combinations of feature sets of the same
sizes as the input feature sets are used. Computation is only feasible for very
small numbers of features and numbers of considered datasets (length(features) ).
For "none", the transformation (score−expected)/(maximum−expected)
is not conducted, i.e. only score is used.
This is not recommended.
|
N |
numeric(1)
Number of random feature sets to consider. Only relevant if correction.for.chance
is set to "estimate".
|
impute.na |
numeric(1)
In some scenarios, the stability cannot be assessed based on all feature sets.
E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result.
With which value should these missing values be imputed? NULL means no imputation.
|
Details
Let Oij
denote the number of features in Vi
that are not
shared with Vj
but that have a highly simlar feature in Vj
:
Oij=∣{x∈(Vi∖Vj):∃y∈(Vj\Vi) with Similarity(x,y)≥threshold}∣.
Then the stability measure is defined as (see Notation)
m(m−1)2∑i=1m−1∑j=i+1m2∣Vi∣+∣Vj∣−E(I(Vi,Vj))I(Vi,Vj)−E(I(Vi,Vj))
with
I(Vi,Vj)=∣Vi∩Vj∣+2Oij+Oji.
Note that this definition slightly differs from its original in order to make it suitable
for arbitrary datasets and similarity measures and applicable in situations with ∣Vi∣=∣Vj∣
.
Value
numeric(1)
Stability value.
Notation
For the definition of all stability measures in this package,
the following notation is used:
Let V1,…,Vm
denote the sets of chosen features
for the m
datasets, i.e. features
has length m
and
Vi
is a set which contains the i
-th entry of features
.
Furthermore, let hj
denote the number of sets that contain feature
Xj
so that hj
is the absolute frequency with which feature Xj
is chosen.
Analogously, let hij
denote the number of sets that include both Xi
and Xj
.
Also, let q=∑j=1phj=∑i=1m∣Vi∣
and V=⋃i=1mVi
.
References
Yu L, Han Y, Berens ME (2012).
“Stable Gene Selection from Microarray Data via Sample Weighting.”
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1), 262–272.
doi:10.1109/tcbb.2011.47.
Zhang M, Zhang L, Zou J, Yao C, Xiao H, Liu Q, Wang J, Wang D, Wang C, Guo Z (2009).
“Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes.”
Bioinformatics, 25(13), 1662–1668.
doi:10.1093/bioinformatics/btp295.
Bommert A (2020).
Integration of Feature Selection Stability in Model Fitting.
Ph.D. thesis, TU Dortmund University, Germany.
doi:10.17877/DE290R-21906.
See Also
listStabilityMeasures
Examples
feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilityYu(features = feats, sim.mat = mat, N = 1000)
[Package
stabm version 1.2.2
Index]