stratified_rf {StratifiedRF} | R Documentation |
Stratified Random Forest
Description
Random Forest that works with groups of predictor variables. When building a tree, a number of variables is taken from each group separately. Useful when rows contain information about different things (e.g. user information and product information) and it's not sensible to make a prediction with information from only one group of variables, or when there are far more variables from one group than the other and it's desired to have groups appear evenly on trees.
Usage
stratified_rf(df, targetvar, groups, mtry = "auto", ntrees = 500,
multicore = TRUE, class_quotas = NULL, sample_weights = NULL,
fulldepth = TRUE, replacement = TRUE, c50_control = NULL,
na.action = na.pass, drop_threshold = NULL)
Arguments
df |
Data to build the model (data.frame only). |
targetvar |
String indicating the name of the target or outcome variable in the data. Character types will be coerced to factors. |
groups |
Unnamed list, containing at each entry a group of variables (as a string vector with their names). |
mtry |
A numeric vector indicating how many variables to take from each group when building each tree. If set to "auto" then, for each group, mtry=round(sqrt(m_total)*len(m_group)/len(m_total)) (with a minimum of 1 for each group). |
ntrees |
Number of trees to grow. When setting multicore=TRUE, the number of trees should be a multiple of the number of cores, otherwise it will get rounded downwards to the nearest multiple. |
multicore |
Whether to use multiple CPU cores to parallelize the construction of trees. Parallelization is done with the 'parallel' library's default settings. |
class_quotas |
How many rows from each class to use in each tree (useful when there is a class imbalance). Must be a numeric vector or a named list with the number of desired rows to sample for each level of the target variable. Ignored when sample_weights is passed. Note that using more rows than the data originally had might result in incorrect out-of-bag error estimates. |
sample_weights |
Probability of sampling each row when building a tree. Must be a numeric vector. If not defined, then all rows have the same probability. Note that, depending on the structure of the data, setting this might result in incorret out-of-bag error estimates. |
fulldepth |
Whether to grow the trees to full depth. Ignored when passing c50_control. |
replacement |
Whether to sample rows with replacement. |
c50_control |
Custom parameters for growing trees. Must be a C5.0Control object compatible with the 'C50' package. |
na.action |
A function indicating how to handle NAs. Default is to include missing values when building a tree (see 'C50' documentation). |
drop_threshold |
Drop a tree whenever its resulting out-of-bag classification accuracy falls below a certain threshold specified here. Must be a number between 0 and 1. |
Details
Note that while this algorithm forces each tree to consider possible splits with variables from all groups, it doesn't guarantee that they will end up having splits with variables from different groups.
The original Random Forest algorithm recommends a total number of sqrt(n_features), but this might not work so well when there are unequal groups of variables.
Implementation of everything outside the tree-building is in native R code, thus might be slow. Trees are grown using the C5.0 algorithm from the 'C50' library, thus it can be used for classification only (not for regression). Refer to the 'C50' library for any documentation about the tree-building algorithm.
See Also
'C50' library: https://cran.r-project.org/package=C50
Examples
data(iris)
groups <- list(c("Sepal.Length","Sepal.Width"),c("Petal.Length","Petal.Width"))
mtry <- c(1,1)
m <- stratified_rf(iris,"Species",groups,mtry,ntrees=2,multicore=FALSE)
summary(m)