LASSO2_XGBtraining {csmpv} | R Documentation |
Variable Selection with LASSO2 and Modeling with XGBoost
Description
This function performs a two-step process: variable selection using LASSO2 and building a predictive model using XGBoost.
Usage
LASSO2_XGBtraining(
data = NULL,
standardization = FALSE,
columnWise = TRUE,
biomks = NULL,
outcomeType = c("binary", "continuous", "time-to-event"),
Y = NULL,
time = NULL,
event = NULL,
nfolds = 10,
nrounds = 5,
nthread = 2,
gamma = 1,
max_depth = 3,
eta = 0.3,
outfile = "nameWithPath"
)
Arguments
data |
A data matrix or data frame containing samples in rows and features/traits in columns. |
standardization |
A logical value indicating if standardization is needed before variable selection. Default is FALSE. |
columnWise |
A logical value indicating if column-wise or row-wise normalization is needed for standardization. Default is TRUE. This parameter is only meaningful when standardization is TRUE. |
biomks |
A vector of potential biomarkers for variable selection. These should be a subset of the column names in the data parameter. |
outcomeType |
The type of the outcome variable: "binary" (default), "continuous", or "time-to-event". |
Y |
The name of the outcome variable when the outcome type is either "binary" or "continuous". |
time |
The name of the time variable when the outcome type is "time-to-event". |
event |
The name of the event variable when the outcome type is "time-to-event". |
nfolds |
The number of folds for cross-validation. The default is 10. |
nrounds |
The maximum number of boosting iterations for the XGBoost model. |
nthread |
The number of parallel threads used for running XGBoost. |
gamma |
The minimum loss reduction required to make a further partition on a leaf node of the tree. |
max_depth |
The maximum depth of a tree in the XGBoost model. |
eta |
The learning rate for the XGBoost model. |
outfile |
A string for the output file, including the path if necessary, but without the file type extension. |
Details
The first part of LASSO2_XGBtraining involves variable selection with LASSO2, typically based on the mean lambda.1se from 10 iterations of n-fold cross-validation-based LASSO regression. In each iteration, a lambda.1se refers to the largest value of lambda such that the error is within 1 standard error of the minimum. However, if only one or no variable is selected, the cross-validation results are ignored, and a minimum of two remaining variables is ensured through full-data lambda simulations.
The second part of LASSO2_XGBtraining involves ignoring the shrunk LASSO coefficients and building an XGBoost model. It is suitable for three types of outcomes: continuous, binary, and time-to-event.
Value
A list is returned:
XGBoost_model |
An XGBoost model |
XGBoost_model_score |
Model scores for the given training data set. For a continuous outcome variable, this is a vector of the estimated continuous values; for a binary outcome variable, this is a vector representing the probability of the positive class; for time-to-event outcome, this a vector of risk scores |
Author(s)
Aixiang Jiang
References
Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent (2010), Journal of Statistical Software, Vol. 33(1), 1-22, doi:10.18637/jss.v033.i01.
Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2011) Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent, Journal of Statistical Software, Vol. 39(5), 1-13, doi:10.18637/jss.v039.i05.
Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System", 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016, https://arxiv.org/abs/1603.02754
Examples
# Load in data sets:
data("datlist", package = "csmpv")
tdat = datlist$training
# The function saves files locally. You can define your own temporary directory.
# If not, tempdir() can be used to get the system's temporary directory.
temp_dir = tempdir()
# As an example, let's define Xvars, which will be used later:
Xvars = c("highIPI", "B.Symptoms", "MYC.IHC", "BCL2.IHC", "CD10.IHC", "BCL6.IHC")
# The function can work with three different outcome types.
# Here, we use binary as an example:
blxfit = LASSO2_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
outfile = paste0(temp_dir, "/binary_LASSO2_XGBoost"))
# You might save the files to the directory you want.
# To delete the "temp_dir", use the following:
unlink(temp_dir)