LASSO2_XGBtraining {csmpv}R Documentation

Variable Selection with LASSO2 and Modeling with XGBoost

Description

This function performs a two-step process: variable selection using LASSO2 and building a predictive model using XGBoost.

Usage

LASSO2_XGBtraining(
  data = NULL,
  standardization = FALSE,
  columnWise = TRUE,
  biomks = NULL,
  outcomeType = c("binary", "continuous", "time-to-event"),
  Y = NULL,
  time = NULL,
  event = NULL,
  nfolds = 10,
  nrounds = 5,
  nthread = 2,
  gamma = 1,
  max_depth = 3,
  eta = 0.3,
  outfile = "nameWithPath"
)

Arguments

data

A data matrix or data frame containing samples in rows and features/traits in columns.

standardization

A logical value indicating if standardization is needed before variable selection. Default is FALSE.

columnWise

A logical value indicating if column-wise or row-wise normalization is needed for standardization. Default is TRUE. This parameter is only meaningful when standardization is TRUE.

biomks

A vector of potential biomarkers for variable selection. These should be a subset of the column names in the data parameter.

outcomeType

The type of the outcome variable: "binary" (default), "continuous", or "time-to-event".

Y

The name of the outcome variable when the outcome type is either "binary" or "continuous".

time

The name of the time variable when the outcome type is "time-to-event".

event

The name of the event variable when the outcome type is "time-to-event".

nfolds

The number of folds for cross-validation. The default is 10.

nrounds

The maximum number of boosting iterations for the XGBoost model.

nthread

The number of parallel threads used for running XGBoost.

gamma

The minimum loss reduction required to make a further partition on a leaf node of the tree.

max_depth

The maximum depth of a tree in the XGBoost model.

eta

The learning rate for the XGBoost model.

outfile

A string for the output file, including the path if necessary, but without the file type extension.

Details

The first part of LASSO2_XGBtraining involves variable selection with LASSO2, typically based on the mean lambda.1se from 10 iterations of n-fold cross-validation-based LASSO regression. In each iteration, a lambda.1se refers to the largest value of lambda such that the error is within 1 standard error of the minimum. However, if only one or no variable is selected, the cross-validation results are ignored, and a minimum of two remaining variables is ensured through full-data lambda simulations.

The second part of LASSO2_XGBtraining involves ignoring the shrunk LASSO coefficients and building an XGBoost model. It is suitable for three types of outcomes: continuous, binary, and time-to-event.

Value

A list is returned:

XGBoost_model

An XGBoost model

XGBoost_model_score

Model scores for the given training data set. For a continuous outcome variable, this is a vector of the estimated continuous values; for a binary outcome variable, this is a vector representing the probability of the positive class; for time-to-event outcome, this a vector of risk scores

Author(s)

Aixiang Jiang

References

Friedman, J., Hastie, T. and Tibshirani, R. (2008) Regularization Paths for Generalized Linear Models via Coordinate Descent (2010), Journal of Statistical Software, Vol. 33(1), 1-22, doi:10.18637/jss.v033.i01.

Simon, N., Friedman, J., Hastie, T. and Tibshirani, R. (2011) Regularization Paths for Cox's Proportional Hazards Model via Coordinate Descent, Journal of Statistical Software, Vol. 39(5), 1-13, doi:10.18637/jss.v039.i05.

Tianqi Chen and Carlos Guestrin, "XGBoost: A Scalable Tree Boosting System", 22nd SIGKDD Conference on Knowledge Discovery and Data Mining, 2016, https://arxiv.org/abs/1603.02754

Examples

# Load in data sets:
data("datlist", package = "csmpv")
tdat = datlist$training

# The function saves files locally. You can define your own temporary directory. 
# If not, tempdir() can be used to get the system's temporary directory.
temp_dir = tempdir()
# As an example, let's define Xvars, which will be used later:
Xvars = c("highIPI", "B.Symptoms", "MYC.IHC", "BCL2.IHC", "CD10.IHC", "BCL6.IHC")
# The function can work with three different outcome types. 
# Here, we use binary as an example:
blxfit = LASSO2_XGBtraining(data = tdat, biomks = Xvars, Y = "DZsig",
                           outfile = paste0(temp_dir, "/binary_LASSO2_XGBoost"))
# You might save the files to the directory you want.

# To delete the "temp_dir", use the following:
unlink(temp_dir)

[Package csmpv version 1.0.3 Index]