R: Univariate feature selection based on univariate significance...

uni.selection {compound.Cox}

R Documentation

Univariate feature selection based on univariate significance tests

Description

This function performs univariate feature selection using significance tests (Wald tests or score tests) based on association between individual features and survival. Features are selected if their P-values are less than a given threshold (P.value).

Usage

uni.selection(t.vec, d.vec, X.mat, P.value=0.001,K=10,score=TRUE,d0=0,
                       randomize=FALSE,CC.plot=FALSE,permutation=FALSE,M=200)

Arguments

`t.vec`	Vector of survival times (time to either death or censoring)
`d.vec`	Vector of censoring indicators (1=death, 0=censoring)
`X.mat`	n by p matrix of covariates, where n is the sample size and p is the number of covariates
`P.value`	A threshold for selecting features
`K`	The number of cross-validation folds
`score`	If TRUE, the score tests are used; if not, the Wald tests are used
`d0`	A positive constant to stabilize the variance of score statistics (Witten & Tibshirani 2010)
`randomize`	If TRUE, randomize patient ID's before cross-validation
`CC.plot`	If TRUE, the compound covariate (CC) predictors are plotted
`permutation`	If TRUE, the FDR is computed by a permutation method (Witten & Tibshirani 2010; Emura et al. 2019).
`M`	The number of permutations to calculate the FDR

Details

The cross-validated likelihood (CVL) value is computed for selected features (Matsui 2006; Emura et al. 2019). A high CVL value corresponds to a better predictive ability of selected features. Hence, the CVL value can be used to find the optimal set of features. The CVL value is computed by a K-fold cross-validation, where the number K can be chosen by user. The false discovery rate (FDR) is also computed by a formula and a permutation test (if "permutation=TRUE"). The RCVL1 and RCVL2 are "re-substitution" CVL values and provide upper control limits for the CVL value. If the CVL value is less than RCVL1 and RCVL2 values, the CVL value would be in-control. On the other hand, if the CVL value exceeds either RCVL1 or RCVL2 value, then the CVL may be computed again after changing the sample allocation.

Value

`gene`	Gene symbols
`beta`	Estimated regression coefficients
`Z`	Z-values for significance tests
`P`	P-values for significance tests
`CVL`	The value of CVL, RCVL1, and RCVL2 (Emura et al. 2019)
`Genes`	The number of genes, the number of selected genes, and the number of falsely selected genes
`FDR`	False discovery rate (by a formula or a permutation method)

Author(s)

Takeshi Emura

References

Emura T, Matsui S, Chen HY (2019). compound.Cox: Univariate Feature Selection and Compound Covariate for Predicting Survival, Computer Methods and Programs in Biomedicine 168: 21-37.

Matsui S (2006). Predicting Survival Outcomes Using Subsets of Significant Genes in Prognostic Marker Studies with Microarrays. BMC Bioinformatics: 7:156.

Witten DM, Tibshirani R (2010) Survival analysis with high-dimensional covariates. Stat Method Med Res 19:29-51

Examples

data(Lung)
t.vec=Lung$t.vec[Lung$train==TRUE]
d.vec=Lung$d.vec[Lung$train==TRUE]
X.mat=Lung[Lung$train==TRUE,-c(1,2,3)]
uni.selection(t.vec, d.vec, X.mat, P.value=0.05,K=5,score=FALSE)
## the outputs reproduce Table 3 of Emura and Chen (2016) ##

[Package compound.Cox version 3.30 Index]