R: Generalized Boosted Modeling via Rank-Based Trees for Single...

rboost {ranktreeEnsemble}

R Documentation

Generalized Boosted Modeling via Rank-Based Trees for Single Sample Classification with Gene Expression Profiles

Description

The function fits generalized boosted models via Rank-Based Trees on both binary and multi-class problems. It converts continuous gene expression profiles into ranked gene pairs, for which the variable importance indices are computed and adopted for dimension reduction. The boosting implementation was directly imported from the gbm package. For technical details, see the vignette: utils::browseVignettes("gbm").

Usage

rboost(
  formula,
  data,
  dimreduce = TRUE,
  datrank = TRUE,
  distribution = "multinomial",
  weights,
  ntree = 100,
  nodedepth = 3,
  nodesize = 5,
  shrinkage = 0.05,
  bag.fraction = 0.5,
  train.fraction = 1,
  cv.folds = 5,
  keep.data = TRUE,
  verbose = TRUE,
  class.stratify.cv = TRUE,
  n.cores = NULL
)

Arguments

`formula`	Object of class 'formula' describing the model to fit.
`data`	Data frame containing the y-outcome and x-variables.
`dimreduce`	Dimension reduction via variable importance weighted forests. `FALSE` means no dimension reduction; `TRUE` means reducing 75% variables before binary rank conversion and then fitting a weighted forest; a numeric value x% between 0 and 1 means reducing x% variables before binary rank conversion and then fitting a weighted forest.
`datrank`	If using ranked raw data for fitting the dimension reduction model.
`distribution`	Either a character string specifying the name of the distribution to use: if the response has only 2 unique values, `bernoulli` is assumed; otherwise, if the response is a factor, `multinomial` is assumed.
`weights`	an optional vector of weights to be used in the fitting process. It must be positive but does not need to be normalized.
`ntree`	Integer specifying the total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion, which matches `n.tree` in the `gbm` package.
`nodedepth`	Integer specifying the maximum depth of each tree. A value of 1 implies an additive model. This matches `interaction.depth` in the `gbm` package.
`nodesize`	Integer specifying the minimum number of observations in the terminal nodes of the trees, which matches `n.minobsinnode` in the `gbm` package.. Note that this is the actual number of observations, not the total weight.
`shrinkage`	a shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction; 0.001 to 0.1 usually work, but a smaller learning rate typically requires more trees. Default is 0.05.
`bag.fraction`	the fraction of the training set observations randomly selected to propose the next tree in the expansion. This introduces randomnesses into the model fit. If `bag.fraction` < 1 then running the same model twice will result in similar but different fits. `gbm` uses the R random number generator so `set.seed` can ensure that the model can be reconstructed. Preferably, the user can save the returned `gbm.object` using `save`. Default is 0.5.
`train.fraction`	The first `train.fraction * nrows(data)` observations are used to fit the `gbm` and the remaining observations are used for computing out-of-sample estimates of the loss function.
`cv.folds`	Number of cross-validation folds to perform. If `cv.folds`>1 then `gbm`, in addition to the usual fit, will perform cross-validation and calculate an estimate of generalization error returned in `cv.error`.
`keep.data`	a logical variable indicating whether to keep the data and an index of the data stored with the object. Keeping the data and index makes subsequent calls to `gbm.more` faster at the cost of storing an extra copy of the dataset.
`verbose`	Logical indicating whether or not to print out progress and performance indicators (`TRUE`). If this option is left unspecified for `gbm.more`, then it uses `verbose` from `object`. Default is `TRUE`.
`class.stratify.cv`	Logical indicating whether or not the cross-validation should be stratified by class. The purpose of stratifying the cross-validation is to help avoid situations in which training sets do not contain all classes.
`n.cores`	The number of CPU cores to use. The cross-validation loop will attempt to send different CV folds off to different cores. If `n.cores` is not specified by the user, it is guessed using the `detectCores` function in the `parallel` package. Note that the documentation for `detectCores` makes clear that it is not failsafe and could return a spurious number of available cores.

Value

`fit`	A vector containing the fitted values on the scale of regression function (e.g. log-odds scale for bernoulli).
`train.error`	A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the training data.
`valid.error`	A vector of length equal to the number of fitted trees containing the value of the loss function for each boosting iteration evaluated on the validation data.
`cv.error`	If `cv.folds` < 2 this component is `NULL`. Otherwise, this component is a vector of length equal to the number of fitted trees containing a cross-validated estimate of the loss function for each boosting iteration.
`oobag.improve`	A vector of length equal to the number of fitted trees containing an out-of-bag estimate of the marginal reduction in the expected value of the loss function. The out-of-bag estimate uses only the training data and is useful for estimating the optimal number of boosting iterations. See `gbm.perf`.
`cv.fitted`	If cross-validation was performed, the cross-validation predicted values on the scale of the linear predictor. That is, the fitted values from the i-th CV-fold, for the model having been trained on the data in all other folds.

Author(s)

Ruijie Yin (Maintainer,<ruijieyin428@gmail.com>), Chen Ye and Min Lu

References

Lu M. Yin R. and Chen X.S. Ensemble Methods of Rank-Based Trees for Single Sample Classification with Gene Expression Profiles. Journal of Translational Medicine. 22, 140 (2024). doi: 10.1186/s12967-024-04940-2

Examples

data(tnbc)
obj <- rboost(subtype~., data = tnbc[,c(1:10,337)])
obj

[Package ranktreeEnsemble version 0.23 Index]