R: Stability Measure Yu

stabilityYu {stabm}

R Documentation

Stability Measure Yu

Description

The stability of feature selection is defined as the robustness of the sets of selected features with respect to small variations in the data on which the feature selection is conducted. To quantify stability, several datasets from the same data generating process can be used. Alternatively, a single dataset can be split into parts by resampling. Either way, all datasets used for feature selection must contain exactly the same features. The feature selection method of interest is applied on all of the datasets and the sets of chosen features are recorded. The stability of the feature selection is assessed based on the sets of chosen features using stability measures.

Usage

stabilityYu(
  features,
  sim.mat,
  threshold = 0.9,
  correction.for.chance = "estimate",
  N = 10000,
  impute.na = NULL
)

Arguments

`features`	`list (length >= 2)` Chosen features per dataset. Each element of the list contains the features for one dataset. The features must be given by their names (`character`) or indices (`integerish`).
`sim.mat`	`numeric matrix` Similarity matrix which contains the similarity structure of all features based on all datasets. The similarity values must be in the range of [0, 1] where 0 indicates very low similarity and 1 indicates very high similarity. If the list elements of `features` are integerish vectors, then the feature numbering must correspond to the ordering of `sim.mat`. If the list elements of `features` are character vectors, then `sim.mat` must be named and the names of `sim.mat` must correspond to the entries in `features`.
`threshold`	`numeric(1)` Threshold for indicating which features are similar and which are not. Two features are considered as similar, if and only if the corresponding entry of `sim.mat` is greater than or equal to `threshold`.
`correction.for.chance`	`character(1)` How should the expected value of the stability score (see Details) be assessed? Options are "estimate", "exact" and "none". For "estimate", `N` random feature sets of the same sizes as the input feature sets (`features`) are generated. For "exact", all possible combinations of feature sets of the same sizes as the input feature sets are used. Computation is only feasible for very small numbers of features and numbers of considered datasets (`length(features)`). For "none", the transformation `(score - expected) / (maximum - expected)` is not conducted, i.e. only `score` is used. This is not recommended.
`N`	`numeric(1)` Number of random feature sets to consider. Only relevant if `correction.for.chance` is set to "estimate".
`impute.na`	`numeric(1)` In some scenarios, the stability cannot be assessed based on all feature sets. E.g. if some of the feature sets are empty, the respective pairwise comparisons yield NA as result. With which value should these missing values be imputed? `NULL` means no imputation.

Details

Let O_{ij} denote the number of features in V_i that are not shared with V_j but that have a highly simlar feature in V_j:

O_{ij} = |\{ x \in (V_i \setminus V_j) : \exists y \in (V_j \backslash V_i) \ with \ Similarity(x,y) \geq threshold \}|.

Then the stability measure is defined as (see Notation)

\frac{2}{m(m-1)}\sum_{i=1}^{m-1} \sum_{j=i+1}^{m} \frac{I(V_i, V_j) - E(I(V_i, V_j))}{\frac{|V_i| + |V_j|}{2} - E(I(V_i, V_j))}

with

I(V_i, V_j) = |V_i \cap V_j| + \frac{O_{ij} + O_{ji}}{2}.

Note that this definition slightly differs from its original in order to make it suitable for arbitrary datasets and similarity measures and applicable in situations with |V_i| \neq |V_j|.

Value

numeric(1) Stability value.

Notation

For the definition of all stability measures in this package, the following notation is used: Let V_1, \ldots, V_m denote the sets of chosen features for the m datasets, i.e. features has length m and V_i is a set which contains the i-th entry of features. Furthermore, let h_j denote the number of sets that contain feature X_j so that h_j is the absolute frequency with which feature X_j is chosen. Analogously, let h_{ij} denote the number of sets that include both X_i and X_j. Also, let q = \sum_{j=1}^p h_j = \sum_{i=1}^m |V_i| and V = \bigcup_{i=1}^m V_i.

References

Yu L, Han Y, Berens ME (2012). “Stable Gene Selection from Microarray Data via Sample Weighting.” IEEE/ACM Transactions on Computational Biology and Bioinformatics, 9(1), 262–272. doi:10.1109/tcbb.2011.47.

Zhang M, Zhang L, Zou J, Yao C, Xiao H, Liu Q, Wang J, Wang D, Wang C, Guo Z (2009). “Evaluating reproducibility of differential expression discoveries in microarray studies by considering correlated molecular changes.” Bioinformatics, 25(13), 1662–1668. doi:10.1093/bioinformatics/btp295.

Bommert A (2020). Integration of Feature Selection Stability in Model Fitting. Ph.D. thesis, TU Dortmund University, Germany. doi:10.17877/DE290R-21906.

Examples

feats = list(1:3, 1:4, 1:5)
mat = 0.92 ^ abs(outer(1:10, 1:10, "-"))
stabilityYu(features = feats, sim.mat = mat, N = 1000)

[Package stabm version 1.2.2 Index]