BMsel {biospear} | R Documentation |
Biomarker selection in a Cox regression model
Description
This function enables to fit a Cox regression model for a prognostic or a biomarker-by-treatment interaction setting subject to a selection procedure to perform variable selection.
Usage
BMsel(data, x, y, z, tt, inter, std.x = TRUE, std.i = FALSE, std.tt = TRUE,
method = c('alassoL', 'alassoR', 'alassoU', 'enet', 'gboost',
'glasso', 'lasso', 'lasso-1se', 'lasso-AIC', 'lasso-BIC',
'lasso-HQIC', 'lasso-pct', 'lasso-pcvl','lasso-RIC', 'modCov',
'PCAlasso', 'PLSlasso', 'ridge', 'ridgelasso', 'stabSel', 'uniFDR'),
folds = 5, uni.fdr = 0.05, uni.test = 1, ss.rando = F, ss.nsub = 100,
ss.fsub = 0.5, ss.fwer = 1, ss.thr = 0.6, dfmax = ncol(data) + 1,
pct.rep = 1, pct.qtl = 0.95, showWarn = TRUE, trace = TRUE)
## S3 method for class 'resBMsel'
summary(object, show = TRUE, keep = c('tt', 'z', 'x', 'xt'),
add.ridge = FALSE, ...)
Arguments
data |
input |
x |
colnames or position of the biomarkers in |
y |
colnames or position of the survival outcome in |
z |
colnames or position of the clinical covariates in |
tt |
colname or position of the treatment in |
inter |
logical parameter indicating if biomarker-by-treatment interactions should be computed. |
std.x |
logical parameter indicating if the biomarkers should be standardized (i.e. substracting by the mean and dividing by the standard deviation of each biomarker). |
std.i |
logical parameter indicating if the biomarker-by-treatment interactions should be standardized (i.e. substracting by the mean and dividing by the standard deviation of each interaction). |
std.tt |
logical parameter indicating if the treatment should be recoded as +/-0.5. |
method |
methods computed to perform variable selection and to estimate the regression coefficients. See the Details section to understand all the implemented methods. |
folds |
number of folds. |
uni.fdr , uni.test |
specific parameters for the univariate procedure. |
ss.fsub , ss.fwer , ss.nsub , ss.rando , ss.thr |
specific parameters for the stability selection. |
dfmax |
limit the maximum number of variables in the model. Useful for very large number of covariates to limit the time computation. |
pct.rep , pct.qtl |
specific parameters for the percentile lasso.
|
showWarn |
logical parameter indicating if warnings should be printed. |
trace |
logical parameter indicating if messages should be printed. |
object |
object of class ' |
show |
parameter for the |
keep |
parameter for the |
add.ridge |
parameter for the |
... |
Details
The objects x
, y
, z
(if any) and tt
(if any) are mandatory for non-simulated data sets.
The method
parameter specifies the approaches for model selection. Most of these selection methods are based on the lasso penalty (Tibshirani, 1996). The tuning parameter is usually chosen though the cross-validated log-likelihood criterion (cvl), except for the empirical extensions of the lasso
in which additional penalties to the cvl (given with a suffix, e.g. lasso-pcvl
) are used to estimate the tuning parameter. Other methods based on the lasso are also implemented such as the adaptive lasso (alassoL
, alassoR
and alassoU
for which the last letter indicates the procedure used to estimate the preliminary weights: "L
" for lasso, "R
" for ridge and "U
" for univariate), the elastic-net (enet
) or the stability selection (stabSel
). For the interaction setting, specific methods were implemented: to reduce/control the main effects matrix (i.e. ridge (ridgelasso
) or dimension reduction (PCAlasso
or PLSlasso
)), to select or discard main effects and interactions simultaneously (i.e. group-lasso (glasso
)), or to include only the interaction part in the model (i.e. modCov
). Some selection methods not based on penalized regression are also proposed: univariate selection (uniFDR
), gradient boosting (gboost
). The ridge
penalty without selection can also be applied.
For all methods but the uniFDR
, tuning parameters are chosen by maximizing the cross-validated log-likelihood (max-cvl). For the elastic-net, the "alpha" parameter (trade-off between ridge and lasso) is investigated among a predefined grid of values (as suggested by the authors, Zou et al. 2005) and the "lambda" is estimated by maximizing the above-mentioned cvl criterion for each of the "alpha" parameter. The combination (alpha; lambda) that maximizes the cvl is finally retained. For the gradient boosting, the number of steps is also estimated by the max-cvl. For the univariate selection, the tuning parameter is the FDR threshold defined by the user to control for multiple testing (using the parameter uni.fdr
).
We have included the possibility to adjust for clinical covariates (z
) for all methods. For penalized regressions, these covariates are considered as unpenalized. For the gradient boosting, a model with clinical covariates is preliminary implemented and regression coefficients are fixed as offset in the boosting approach. For the univariate selection, clinical covariates are forced as adjustment variables in the model and the FDR is calculated on the Wald p-values of the coefficient associated with the biomarker in such models.
Value
An object of class 'resBMsel
' containing the list of the selected biomarkers and their estimated regression coefficients for the chosen methods.
Author(s)
Nils Ternes, Federico Rotolo, and Stefan Michiels
Maintainer: Nils Ternes nils.ternes@yahoo.com
References
Ternes N, Rotolo F and Michiels S.
Empirical extensions of the lasso penalty to reduce
the false discovery rate in high-dimensional Cox regression models.
Statistics in Medicine 2016;35(15):2561-2573.
doi:10.1002/sim.6927
Ternes N, Rotolo F, Heinze G and Michiels S.
Identification of biomarker-by-treatment interactions in randomized
clinical trials with survival outcomes and high-dimensional spaces.
Biometrical journal. In press.
doi:10.1002/bimj.201500234
Tibshirani R.
Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society, Ser B 1996;58:267-288.
Examples
########################################
# Simulated data set
########################################
## Low calculation time
set.seed(654321)
sdata <- simdata(
n = 500, p = 20, q.main = 3, q.inter = 0,
prob.tt = 0.5, alpha.tt = 0,
beta.main = -0.8,
b.corr = 0.6, b.corr.by = 4,
m0 = 5, wei.shape = 1, recr = 4, fu = 2,
timefactor = 1)
resBM <- BMsel(
data = sdata,
method = c("lasso", "lasso-pcvl"),
inter = FALSE,
folds = 5)
summary(resBM)
## Not run:
## Moderate calculation time
set.seed(123456)
sdata <- simdata(
n = 500, p = 100, q.main = 5, q.inter = 5,
prob.tt = 0.5, alpha.tt = -0.5,
beta.main = c(-0.5, -0.2), beta.inter = c(-0.7, -0.4),
b.corr = 0.6, b.corr.by = 10,
m0 = 5, wei.shape = 1, recr = 4, fu = 2,
timefactor = 1,
active.inter = c("bm003", "bm021", "bm044", "bm049", "bm097"))
resBM <- BMsel(
data = sdata,
method = c("lasso", "lasso-pcvl"),
inter = TRUE,
folds = 5)
summary(resBM)
summary(resBM, keep = "xt")
## End(Not run)
########################################
# Breast cancer data set
########################################
## Not run:
data(Breast)
dim(Breast)
set.seed(123456)
resBM <- BMsel(
data = Breast,
x = 4:ncol(Breast),
y = 2:1,
tt = 3,
inter = FALSE,
std.x = TRUE,
folds = 5,
method = c("lasso", "lasso-pcvl"))
summary(resBM)
## End(Not run)
########################################
########################################