rare_level_sampler {pre}R Documentation

Dealing with rare factor levels in fitting prediction rule ensembles.

Description

Provides a sampling function to be supplied to the sampfrac argument of function pre, making sure that each level of specified factor(s) are present in each sample.

Usage

rare_level_sampler(factors, data, sampfrac = 0.5, warning = FALSE)

Arguments

factors

Character vector with name(s) of factors with rare levels.

data

data.frame containing the variables in the model. Response must be of class factor for classification, numeric for (count) regression, Surv for survival regression. Input variables must be of class numeric, factor or ordered factor. Otherwise, pre will attempt to recode.

sampfrac

numeric value > 0 and \le 1. Specifies the fraction of randomly selected training observations used to produce each tree. Values < 1 will result in sampling without replacement (i.e., subsampling), a value of 1 will result in sampling with replacement (i.e., bootstrap sampling). Alternatively, a sampling function may be supplied, which should take arguments n (sample size) and weights.

warning

logical. Whether a warning should be printed if observations with rare factor levels are added to the training sample of the current iteration.

Details

Categorical predictor variables (factors) with rare levels may be problematic in boosting algorithms employing sampling (which is employed by default in function pre).

If a sample in a given boosting iteration does not have any observations with a given (rare) level of a factor, while this level is present in the full training dataset, and the factor is selected for splitting in the tree, then no prediction for that level of the factor can be generated, resulting in an error. Note that boosting methods other than pre that also employ sampling (e.g., gbm or xgboost) may not generate an error in such cases, but also do not document how intermediate predictions are generated in such a case. It is likely that these methods use one-hot-encoding of factors, which from a perspective of model interpretation introduces new problems, especially when the aim is to obtain a sparse set of rules as in 'pre'.

With function pre(), the rare-factor-level issue, if encountered, can be dealt with by the user in one of the following ways (in random order):

Value

A sampling function, which generates sub- or bootstrap samples as usual in function pre, but checks if all levels of the specified factor(s) are present and adds observation with those levels if not. If warning = TRUE, a warning is issued).

See Also

pre

Examples

## Create dataset with two factors containing rare levels
dat <- iris[iris$Species != "versicolor", ]
dat <- rbind(dat, iris[iris$Species == "versicolor", ][1:5, ])
dat$factor2 <- factor(rep(1:21, times = 5))

## Set up sampling function
samp_func <- rare_level_sampler(c("Species", "factor2"), data = dat, 
                                  sampfrac = .51, warning = TRUE)

## Illustrate what it does                                                                   
N <- nrow(dat)
wts <- rep(1, times = nrow(dat))
set.seed(3)
dat[samp_func(n = N, weights = wts), ] # single sample
for (i in 1:500) dat[samp_func(n = N, weights = wts), ]
warnings() # to illustrate warnings that may occur when fitting a full PRE

## Illustrate use with function pre:
## (Note: low ntrees value merely to reduce computation time for the example)
set.seed(42)
# iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20) # would yield error
iris.ens <- pre(Petal.Width ~ . , data = dat, ntrees = 20, 
  sampfrac = samp_func) # should work

[Package pre version 1.0.7 Index]