wsrf {wsrf} | R Documentation |
Build a Forest of Weighted Subspace Decision Trees
Description
Build weighted subspace C4.5-based decision trees to construct a forest.
Usage
## S3 method for class 'formula'
wsrf(formula, data, ...)
## Default S3 method:
wsrf(x, y, mtry=floor(log2(length(x))+1), ntree=500,
weights=TRUE, parallel=TRUE, na.action=na.fail,
importance=FALSE, nodesize=2, clusterlogfile, ...)
Arguments
x , formula |
a data frame or a matrix of predictors, or a formula with a response but no interaction terms. |
y |
a response vector. |
data |
a data frame in which to interpret the variables named in the formula. |
ntree |
number of trees to grow. By default, 500 |
mtry |
number of variables to choose as candidates at each node
split, by default, |
weights |
logical. |
na.action |
a function indicate the behaviour when encountering
NA values in |
parallel |
whether to run multiple cores (TRUE), nodes, or sequentially (FALSE). |
importance |
should importance of predictors be assessed? |
nodesize |
minimum size of leaf node, i.e., minimum number of observations a leaf node represents. By default, 2. |
clusterlogfile |
character. The pathname of the log file when building model in a cluster. For debug. |
... |
optional parameters to be passed to the low level function
|
Details
See Xu, Huang, Williams, Wang, and Ye (2012) for more details of the algorithm, and Zhao, Williams, Huang (2017) for more details of the package.
Currently, wsrf can only be used for classification. When
weights=FALSE
, C4.5-based trees (Quinlan (1993)) are grown by
wsrf
, where binary split is used for continuous predictors
(variables) and k-way split for categorical ones. For
continuous predictors, each of the values themselves is used as split
points, no discretization used. The only stopping condition for split
is the minimum node size must not less than nodesize
.
Value
An object of class wsrf, which is a list with the following components:
confusion |
the confusion matrix of the prediction (based on OOB data). |
oob.times |
number of times cases are ‘out-of-bag’ (and thus used in computing OOB error estimate) |
predicted |
the predicted values of the input data based on out-of-bag samples. |
useweights |
logical. Whether weighted subspace selection is used? NULL if the model is obtained by combining multiple wsrf model and one of them has different value of 'useweights'. |
mtry |
integer. The number of variables to be chosen when splitting a node. |
Author(s)
He Zhao and Graham Williams (SIAT, CAS)
References
Xu, B. and Huang, J. Z. and Williams, G. J. and Wang, Q. and Ye, Y. 2012 "Classifying very high-dimensional data with random forests built from small subspaces". International Journal of Data Warehousing and Mining (IJDWM), 8(2), 44–63.
Quinlan, J. R. 1993 C4.5: Programs for Machine Learning. Morgan Kaufmann.
Zhao, H. and Williams, G. J. and Huang, J. Z. 2017 "wsrf: An R Package for Classification with Scalable Weighted Subspace Random Forests". Journal of Statistical Software, 77(3), 1–30. doi:10.18637/jss.v077.i03
Examples
library("wsrf")
# Prepare parameters.
ds <- iris
dim(ds)
names(ds)
target <- "Species"
vars <- names(ds)
if (sum(is.na(ds[vars]))) ds[vars] <- randomForest::na.roughfix(ds[vars])
ds[target] <- as.factor(ds[[target]])
(tt <- table(ds[target]))
form <- as.formula(paste(target, "~ ."))
set.seed(42)
train <- sample(nrow(ds), 0.7*nrow(ds))
test <- setdiff(seq_len(nrow(ds)), train)
# Build model. We disable parallelism here, since CRAN Repository
# Policy (https://cran.r-project.org/web/packages/policies.html)
# limits the usage of multiple cores to save the limited resource of
# the check farm.
model.wsrf <- wsrf(form, data=ds[train, vars], parallel=FALSE)
# View model.
print(model.wsrf)
print(model.wsrf, tree=1)
# Evaluate.
strength(model.wsrf)
correlation(model.wsrf)
res <- predict(model.wsrf, newdata=ds[test, vars], type=c("response", "waprob"))
actual <- ds[test, target]
(accuracy.wsrf <- mean(res$response==actual))
# Different type of prediction.
cl <- apply(res$waprob, 1, which.max)
cl <- factor(cl, levels=1:ncol(res$waprob), labels=levels(actual))
(accuracy2.wsrf <- mean(cl==actual))