R: Build a Forest of Weighted Subspace Decision Trees

wsrf {wsrf}

R Documentation

Build a Forest of Weighted Subspace Decision Trees

Description

Build weighted subspace C4.5-based decision trees to construct a forest.

Usage


## S3 method for class 'formula'
wsrf(formula, data, ...)
## Default S3 method:
wsrf(x, y, mtry=floor(log2(length(x))+1), ntree=500,
                       weights=TRUE, parallel=TRUE, na.action=na.fail,
                       importance=FALSE, nodesize=2, clusterlogfile, ...)

Arguments

`x`, `formula`	a data frame or a matrix of predictors, or a formula with a response but no interaction terms.
`y`	a response vector.
`data`	a data frame in which to interpret the variables named in the formula.
`ntree`	number of trees to grow. By default, 500
`mtry`	number of variables to choose as candidates at each node split, by default, `floor(log2(length(x))+1)`.
`weights`	logical. `TRUE` for weighted subspace selection, which is the default; `FALSE` for random selection, and the trees are based on C4.5.
`na.action`	a function indicate the behaviour when encountering NA values in `data`. By default, `na.fail`. If `NULL`, do nothing.
`parallel`	whether to run multiple cores (TRUE), nodes, or sequentially (FALSE).
`importance`	should importance of predictors be assessed?
`nodesize`	minimum size of leaf node, i.e., minimum number of observations a leaf node represents. By default, 2.
`clusterlogfile`	character. The pathname of the log file when building model in a cluster. For debug.
`...`	optional parameters to be passed to the low level function `wsrf.default`.

Details

See Xu, Huang, Williams, Wang, and Ye (2012) for more details of the algorithm, and Zhao, Williams, Huang (2017) for more details of the package.

Currently, wsrf can only be used for classification. When weights=FALSE, C4.5-based trees (Quinlan (1993)) are grown by wsrf, where binary split is used for continuous predictors (variables) and k-way split for categorical ones. For continuous predictors, each of the values themselves is used as split points, no discretization used. The only stopping condition for split is the minimum node size must not less than nodesize.

Value

An object of class wsrf, which is a list with the following components:

`confusion`	the confusion matrix of the prediction (based on OOB data).
`oob.times`	number of times cases are ‘out-of-bag’ (and thus used in computing OOB error estimate)
`predicted`	the predicted values of the input data based on out-of-bag samples.
`useweights`	logical. Whether weighted subspace selection is used? NULL if the model is obtained by combining multiple wsrf model and one of them has different value of 'useweights'.
`mtry`	integer. The number of variables to be chosen when splitting a node.

Author(s)

He Zhao and Graham Williams (SIAT, CAS)

References

Xu, B. and Huang, J. Z. and Williams, G. J. and Wang, Q. and Ye, Y. 2012 "Classifying very high-dimensional data with random forests built from small subspaces". International Journal of Data Warehousing and Mining (IJDWM), 8(2), 44–63.

Quinlan, J. R. 1993 C4.5: Programs for Machine Learning. Morgan Kaufmann.

Zhao, H. and Williams, G. J. and Huang, J. Z. 2017 "wsrf: An R Package for Classification with Scalable Weighted Subspace Random Forests". Journal of Statistical Software, 77(3), 1–30. doi:10.18637/jss.v077.i03

Examples

  library("wsrf")

  # Prepare parameters.
  ds <- iris
  dim(ds)
  names(ds)
  target <- "Species"
  vars   <- names(ds)
  if (sum(is.na(ds[vars]))) ds[vars] <- randomForest::na.roughfix(ds[vars])
  ds[target] <- as.factor(ds[[target]])
  (tt  <- table(ds[target]))
  form <- as.formula(paste(target, "~ ."))
  set.seed(42)
  train <- sample(nrow(ds), 0.7*nrow(ds))
  test  <- setdiff(seq_len(nrow(ds)), train)

  # Build model.  We disable parallelism here, since CRAN Repository
  # Policy (https://cran.r-project.org/web/packages/policies.html)
  # limits the usage of multiple cores to save the limited resource of
  # the check farm.

  model.wsrf <- wsrf(form, data=ds[train, vars], parallel=FALSE)
  
  # View model.
  print(model.wsrf)
  print(model.wsrf, tree=1)

  # Evaluate.
  strength(model.wsrf)
  correlation(model.wsrf)
  res <- predict(model.wsrf, newdata=ds[test, vars], type=c("response", "waprob"))
  actual <- ds[test, target]
  (accuracy.wsrf <- mean(res$response==actual))
  
  # Different type of prediction.
  cl <- apply(res$waprob, 1, which.max)
  cl <- factor(cl, levels=1:ncol(res$waprob), labels=levels(actual))
  (accuracy2.wsrf <- mean(cl==actual))

[Package wsrf version 1.7.30 Index]