rare_level_sampler {pre} | R Documentation |
Dealing with rare factor levels in fitting prediction rule ensembles.
Description
Provides a sampling function to be supplied to the sampfrac
argument of function pre
, making sure that each level of specified factor(s)
are present in each sample.
Usage
rare_level_sampler(factors, data, sampfrac = 0.5, warning = FALSE)
Arguments
factors |
Character vector with name(s) of factors with rare levels. |
data |
|
sampfrac |
numeric value |
warning |
logical. Whether a warning should be printed if observations with rare factor levels are added to the training sample of the current iteration. |
Details
Categorical predictor variables (factors) with rare levels may be problematic
in boosting algorithms employing sampling (which is employed by default in
function pre
).
If a sample in a given boosting iteration does not have any observations with a given
(rare) level of a factor, while this level is present in the full training dataset, and
the factor is selected for splitting in the tree, then no prediction for that level of the factor
can be generated, resulting in an error. Note that boosting methods other than pre
that also
employ sampling (e.g., gbm
or xgboost
) may not generate an error in such cases,
but also do not document how intermediate predictions are generated in such a case. It is likely that
these methods use one-hot-encoding of factors, which from a perspective of model interpretation
introduces new problems, especially when the aim is to obtain a sparse set of rules as in 'pre'.
With function pre()
, the rare-factor-level issue, if encountered, can be dealt with by the user
in one of the following ways (in random order):
Use a sampling function that guarantees inclusion of rare factor levels in each sample. E.g., use
rare_level_sampler
, yielding a sampling function which creates training samples guaranteed to include each level of specified factor(s). Advantage: No loss of information, easy to implement, guaranteed to solve the issue. Disadvantage: May result in oversampling of observations with rare factor levels, potentially biasing results. The bias is likely small though, and will be larger for smaller sample sizes and sampling fractions, and for larger numbers of rare levels. The latter will also increase computational demands.Specify
learnrate = 0
. This results in a (su)bagging instead of boosting approach. Advantage: Eliminates the rare-factor-level issue completely, because intermediate predictions need not be computed. Disadvantage: Boosting with low learning rate often improves predictive accuracy.Data pre-processing: Before running function
pre()
, combine rare factor levels with other levels of the factors. Advantage: Limited loss of information. Disadvantage: Likely, but not guaranteed to solve the issue.Data pre-processing: Apply one-hot encoding to the predictor matrix before applying function 'pre()'. This can easily be done through applying function
model.matrix
. Advantage: Guaranteed to solve the error, easy to implement. Disadvantage: One-hot-encoding increases the number of predictor variables which may reduce interpretability and, but probably to a lesser extent, accuracy.Data pre-processing: Remove observations with rare factor levels from the dataset before running function
pre()
. Advantage: Guaranteed to solve the error. Disadvantage: Removing outliers results in a loss of information, and may bias the results.Increase the value of
sampfrac
argument of functionpre()
. Advantage: Easy to implement. Disadvantage: Larger samples are more likely but not guaranteed to contain all possible factor levels, thus not guaranteed to solve the issue.
Value
A sampling function, which generates sub- or bootstrap samples as usual in function pre
, but
checks if all levels of the specified factor(s) are present and adds observation with those levels if not.
If warning = TRUE
, a warning is issued).
See Also
Examples
## Create dataset with two factors containing rare levels
dat <- iris[iris$Species != "versicolor", ]
dat <- rbind(dat, iris[iris$Species == "versicolor", ][1:5, ])
dat$factor2 <- factor(rep(1:21, times = 5))
## Set up sampling function
samp_func <- rare_level_sampler(c("Species", "factor2"), data = dat,
sampfrac = .51, warning = TRUE)
## Illustrate what it does
N <- nrow(dat)
wts <- rep(1, times = nrow(dat))
set.seed(3)
dat[samp_func(n = N, weights = wts), ] # single sample
for (i in 1:500) dat[samp_func(n = N, weights = wts), ]
warnings() # to illustrate warnings that may occur when fitting a full PRE
## Illustrate use with function pre:
## (Note: low ntrees value merely to reduce computation time for the example)
set.seed(42)
# iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20) # would yield error
iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20,
sampfrac = samp_func) # should work