divfor {diversityForest}R Documentation

Construct a basic diversity forest prediction rule that uses univariable, binary splitting.

Description

Implements the most basic form of diversity forests that uses univariable, binary splitting. Currently, categorical, metric, and survival outcomes are supported.

Usage

divfor(
  formula = NULL,
  data = NULL,
  num.trees = 500,
  mtry = NULL,
  importance = "none",
  write.forest = TRUE,
  probability = FALSE,
  min.node.size = NULL,
  max.depth = NULL,
  replace = TRUE,
  sample.fraction = ifelse(replace, 1, 0.632),
  case.weights = NULL,
  class.weights = NULL,
  splitrule = NULL,
  num.random.splits = 1,
  alpha = 0.5,
  minprop = 0.1,
  split.select.weights = NULL,
  always.split.variables = NULL,
  respect.unordered.factors = NULL,
  scale.permutation.importance = FALSE,
  keep.inbag = FALSE,
  inbag = NULL,
  holdout = FALSE,
  quantreg = FALSE,
  oob.error = TRUE,
  num.threads = NULL,
  save.memory = FALSE,
  verbose = TRUE,
  seed = NULL,
  dependent.variable.name = NULL,
  status.variable.name = NULL,
  classification = NULL,
  nsplits = 30,
  proptry = 1
)

Arguments

formula

Object of class formula or character describing the model to fit. Interaction terms supported only for numerical variables.

data

Training data of class data.frame, matrix, dgCMatrix (Matrix) or gwaa.data (GenABEL).

num.trees

Number of trees. Default is 500.

mtry

Artefact from 'ranger'. NOT needed for diversity forests.

importance

Variable importance mode, one of 'none', 'impurity', 'impurity_corrected', 'permutation'. The 'impurity' measure is the Gini index for classification, the variance of the responses for regression and the sum of test statistics (see splitrule) for survival. NOTE: Currently, only "permutation" (and "none") work for diversity forests.

write.forest

Save divfor.forest object, required for prediction. Set to FALSE to reduce memory usage if no prediction intended.

probability

Grow a probability forest as in Malley et al. (2012). NOTE: Not yet implemented for diversity forests!

min.node.size

Minimal node size. Default 1 for classification, 5 for regression, 3 for survival, and 5 for probability.

max.depth

Maximal tree depth. A value of NULL or 0 (the default) corresponds to unlimited depth, 1 to tree stumps (1 split per tree).

replace

Sample with replacement.

sample.fraction

Fraction of observations to sample. Default is 1 for sampling with replacement and 0.632 for sampling without replacement. For classification, this can be a vector of class-specific values.

case.weights

Weights for sampling of training observations. Observations with larger weights will be selected with higher probability in the bootstrap (or subsampled) samples for the trees.

class.weights

Weights for the outcome classes (in order of the factor levels) in the splitting rule (cost sensitive learning). Classification and probability prediction only. For classification the weights are also applied in the majority vote in terminal nodes.

splitrule

Splitting rule. For classification and probability estimation "gini" or "extratrees" with default "gini". For regression "variance", "extratrees" or "maxstat" with default "variance". For survival "logrank", "extratrees", "C" or "maxstat" with default "logrank". NOTE: For diversity forests currently only the default splitting rules are supported.

num.random.splits

Artefact from 'ranger'. NOT needed for diversity forests.

alpha

For "maxstat" splitrule: Significance threshold to allow splitting. NOT needed for diversity forests.

minprop

For "maxstat" splitrule: Lower quantile of covariate distribution to be considered for splitting. NOT needed for diversity forests.

split.select.weights

Numeric vector with weights between 0 and 1, representing the probability to select variables for splitting. Alternatively, a list of size num.trees, containing split select weight vectors for each tree can be used.

always.split.variables

Currently not useable. Character vector with variable names to be always selected.

respect.unordered.factors

Handling of unordered factor covariates. One of 'ignore' and 'order' (the option 'partition' possible in 'ranger' is not (yet) possible with diversity forests). Default is 'ignore'. Alternatively TRUE (='order') or FALSE (='ignore') can be used.

scale.permutation.importance

Scale permutation importance by standard error as in (Breiman 2001). Only applicable if permutation variable importance mode selected.

keep.inbag

Save how often observations are in-bag in each tree.

inbag

Manually set observations per tree. List of size num.trees, containing inbag counts for each observation. Can be used for stratified sampling.

holdout

Hold-out mode. Hold-out all samples with case weight 0 and use these for variable importance and prediction error.

quantreg

Prepare quantile prediction as in quantile regression forests (Meinshausen 2006). Regression only. Set keep.inbag = TRUE to prepare out-of-bag quantile prediction.

oob.error

Compute OOB prediction error. Set to FALSE to save computation time, e.g. for large survival forests.

num.threads

Number of threads. Default is number of CPUs available.

save.memory

Use memory saving (but slower) splitting mode. No effect for survival and GWAS data. Warning: This option slows down the tree growing, use only if you encounter memory problems. NOT needed for diversity forests.

verbose

Show computation status and estimated runtime.

seed

Random seed. Default is NULL, which generates the seed from R. Set to 0 to ignore the R seed.

dependent.variable.name

Name of outcome variable, needed if no formula given. For survival forests this is the time variable.

status.variable.name

Name of status variable, only applicable to survival data and needed if no formula given. Use 1 for event and 0 for censoring.

classification

Only needed if data is a matrix. Set to TRUE to grow a classification forest.

nsplits

Number of candidate splits to sample for each split. Default is 30.

proptry

Parameter that restricts the number of candidate splits considered for small nodes. If nsplits is larger than proptry times the number of all possible splits, the number of candidate splits to draw is reduced to the largest integer smaller than proptry times the number of all possible splits. Default is 1, which corresponds to always using nsplits candidate splits.

Value

Object of class divfor with elements

forest

Saved forest (If write.forest set to TRUE). Note that the variable IDs in the split.varIDs object do not necessarily represent the column number in R.

predictions

Predicted classes/values, based on out-of-bag samples (classification and regression only).

variable.importance

Variable importance for each independent variable.

prediction.error

Overall out-of-bag prediction error. For classification this is the fraction of missclassified samples, for probability estimation the Brier score, for regression the mean squared error and for survival one minus Harrell's C-index.

r.squared

R squared. Also called explained variance or coefficient of determination (regression only). Computed on out-of-bag data.

confusion.matrix

Contingency table for classes and predictions based on out-of-bag samples (classification only).

unique.death.times

Unique death times (survival only).

chf

Estimated cumulative hazard function for each sample (survival only).

survival

Estimated survival function for each sample (survival only).

call

Function call.

num.trees

Number of trees.

num.independent.variables

Number of independent variables.

min.node.size

Value of minimal node size used.

treetype

Type of forest/tree. classification, regression or survival.

importance.mode

Importance mode used.

num.samples

Number of samples.

splitrule

Splitting rule.

replace

Sample with replacement.

nsplits

Value of nsplits used.

proptry

Value of proptry used.

Author(s)

Roman Hornung, Marvin N. Wright

References

See Also

predict.divfor

Examples

## Not run: 

## Load package:
library("diversityForest")

## Set seed to obtain reproducible results:
set.seed(1234)

## Diversity forest with default settings (NOT recommended)
# Classification:
divfor(Species ~ ., data = iris, num.trees = 20)
# Regression:
iris2 <- iris; iris2$Species <- NULL; iris2$Y <- rnorm(nrow(iris2))
divfor(Y ~ ., data = iris2, num.trees = 20)
# Survival:
library("survival")
divfor(Surv(time, status) ~ ., data = veteran, num.trees = 20, respect.unordered.factors = "order")
# NOTE: num.trees = 20 is specified too small for practical 
# purposes - the prediction performance of the resulting 
# forest will be suboptimal!!
# In practice, num.trees = 500 (default value) or a 
# larger number should be used.

## Diversity forest with specified values for nsplits and proptry (NOT recommended)
divfor(Species ~ ., data = iris, nsplits = 10, proptry = 0.4, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.

## Applying diversity forest after optimizing the values of nsplits and proptry (recommended)
tuneres <- tunedivfor(formula = Species ~ ., data = iris, num.trees.pre = 20)
# NOTE: num.trees.pre = 20 is specified too small for practical 
# purposes - the out-of-bag error estimates of the forests 
# constructed during optimization will be much too variable!!
# In practice, num.trees.pre = 500 (default value) or a 
# larger number should be used.
divfor(Species ~ ., data = iris, nsplits = tuneres$nsplitsopt, 
  proptry = tuneres$proptryopt, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.

## Prediction
train.idx <- sample(nrow(iris), 2/3 * nrow(iris))
iris.train <- iris[train.idx, ]
iris.test <- iris[-train.idx, ]
tuneres <- tunedivfor(formula = Species ~ ., data = iris.train, num.trees.pre = 20)
# NOTE again: num.trees.pre = 20 is specified too small for practical purposes.
rg.iris <- divfor(Species ~ ., data = iris.train, nsplits = tuneres$nsplitsopt, 
  proptry = tuneres$proptryopt, num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.
pred.iris <- predict(rg.iris, data = iris.test)
table(iris.test$Species, pred.iris$predictions)

## Variable importance
rg.iris <- divfor(Species ~ ., data = iris, importance = "permutation", num.trees = 20)
# NOTE again: num.trees = 20 is specified too small for practical purposes.
rg.iris$variable.importance

## End(Not run)


[Package diversityForest version 0.4.0 Index]