rboost {ranktreeEnsemble}R Documentation

Generalized Boosted Modeling via Rank-Based Trees for Single Sample Classification with Gene Expression Profiles

Description

The function fits generalized boosted models via Rank-Based Trees on both binary and multi-class problems. It converts continuous gene expression profiles into ranked gene pairs, for which the variable importance indices are computed and adopted for dimension reduction. The boosting implementation was directly imported from the gbm package. For technical details, see the vignette: utils::browseVignettes("gbm").

Usage

rboost(
  formula,
  data,
  dimreduce = TRUE,
  datrank = TRUE,
  distribution = "multinomial",
  weights,
  ntree = 100,
  nodedepth = 3,
  nodesize = 5,
  shrinkage = 0.05,
  bag.fraction = 0.5,
  train.fraction = 1,
  cv.folds = 5,
  keep.data = TRUE,
  verbose = TRUE,
  class.stratify.cv = TRUE,
  n.cores = NULL
)

Arguments

formula

Object of class 'formula' describing the model to fit.

data

Data frame containing the y-outcome and x-variables.

dimreduce

Dimension reduction via variable importance weighted forests. FALSE means no dimension reduction; TRUE means reducing 75% variables before binary rank conversion and then fitting a weighted forest; a numeric value x% between 0 and 1 means reducing x% variables before binary rank conversion and then fitting a weighted forest.

datrank

If using ranked raw data for fitting the dimension reduction model.

distribution

Either a character string specifying the name of the distribution to use: if the response has only 2 unique values, bernoulli is assumed; otherwise, if the response is a factor, multinomial is assumed.

weights

an optional vector of weights to be used in the fitting process. It must be positive but does not need to be normalized.

ntree

Integer specifying the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion, which matches n.tree in the gbm package.

nodedepth

Integer specifying the maximum depth of each tree. A value of 1 implies an additive model. This matches interaction.depth in the gbm package.

nodesize

Integer specifying the minimum number of observations in the terminal nodes of the trees, which matches n.minobsinnode in the gbm package.. Note that this is the actual number of observations, not the total weight.

shrinkage

a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default is 0.05.

bag.fraction

the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If bag.fraction < 1 then running the same model twice will result in similar but different fits. gbm uses the R random number generator so set.seed can ensure that the model can be reconstructed. Preferably, the user can save the returned gbm.object using save. Default is 0.5.

train.fraction

The first train.fraction * nrows(data) observations are used to fit the gbm and the remaining observations are used for computing out-of-sample estimates of the loss function.

cv.folds

Number of cross-validation folds to perform. If cv.folds>1 then gbm, in addition to the usual fit, will perform cross-validation and calculate an estimate of generalization error returned in cv.error.

keep.data

a logical variable indicating whether to keep the data and an index of the data stored with the object. Keeping the data and index makes subsequent calls to gbm.more faster at the cost of storing an extra copy of the dataset.

verbose

Logical indicating whether or not to print out progress and performance indicators (TRUE). If this option is left unspecified for gbm.more, then it uses verbose from object. Default is TRUE.

class.stratify.cv

Logical indicating whether or not the cross-validation should be stratified by class. The purpose of stratifying the cross-validation is to help avoid situations in which training sets do not contain all classes.

n.cores

The number of CPU cores to use. The cross-validation loop will attempt to send different CV folds off to different cores. If n.cores is not specified by the user, it is guessed using the detectCores function in the parallel package. Note that the documentation for detectCores makes clear that it is not failsafe and could return a spurious number of available cores.

Value

fit

A vector containing the fitted values on the scale of regression function (e.g. log-odds scale for bernoulli).

train.error

A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the training data.

valid.error

A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the validation data.

cv.error

If cv.folds < 2 this component is NULL. Otherwise, this component is a vector of length equal to the number of fitted trees containing a cross-validated estimate of the loss function for each boosting iteration.

oobag.improve

A vector of length equal to the number of fitted trees containing an out-of-bag estimate of the marginal reduction in the expected value of the loss function. The out-of-bag estimate uses only the training data and is useful for estimating the optimal number of boosting iterations. See gbm.perf.

cv.fitted

If cross-validation was performed, the cross-validation predicted values on the scale of the linear predictor. That is, the fitted values from the i-th CV-fold, for the model having been trained on the data in all other folds.

Author(s)

Ruijie Yin (Maintainer,<ruijieyin428@gmail.com>), Chen Ye and Min Lu

References

Lu M. Yin R. and Chen X.S. (2023). Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles.

Examples

data(tnbc)
obj <- rboost(subtype~., data = tnbc[,c(1:10,337)])
obj

[Package ranktreeEnsemble version 0.22 Index]