pre {pre} | R Documentation |
Derive a prediction rule ensemble
Description
Function pre
derives a sparse ensemble of rules and/or linear functions for
prediction of a continuous, binary, count, multinomial, multivariate
continuous or survival response.
Usage
pre(
formula,
data,
family = gaussian,
ad.alpha = NA,
ad.penalty = "lambda.min",
use.grad = TRUE,
weights,
type = "both",
sampfrac = 0.5,
maxdepth = 3L,
learnrate = 0.01,
mtry = Inf,
ntrees = 500,
confirmatory = NULL,
singleconditions = FALSE,
winsfrac = 0.025,
normalize = TRUE,
standardize = FALSE,
ordinal = TRUE,
nfolds = 10L,
tree.control,
tree.unbiased = TRUE,
removecomplements = TRUE,
removeduplicates = TRUE,
verbose = FALSE,
par.init = FALSE,
par.final = FALSE,
sparse = FALSE,
...
)
Arguments
formula |
a symbolic description of the model to be fit of the form
|
data |
|
family |
specifies a glm family object. Can be a character string (i.e.,
|
ad.alpha |
Alpha value to be used for computing the penalty weights for the
adaptive lasso. Defaults to |
ad.penalty |
Penalty parameter value to be used for computing the penalty
weights for the adaptive lasso. Defaults to |
use.grad |
logical. Should gradient boosting with regression trees be
employed when |
weights |
optional vector of observation weights to be used for deriving the ensemble. |
type |
character. Specifies type of base learners to include in the
ensemble. Defaults to |
sampfrac |
numeric value |
maxdepth |
positive integer. Maximum number of conditions in rules.
If |
learnrate |
numeric value |
mtry |
positive integer. Number of randomly selected predictor variables for
creating each split in each tree. Ignored when |
ntrees |
positive integer value. Number of trees to generate for the initial ensemble. |
confirmatory |
character vector. Specifies one or more confirmatory terms
to be included in the final ensemble. Linear terms can be specified as the
name of a predictor variable included in |
singleconditions |
|
winsfrac |
numeric value |
normalize |
logical. Normalize linear variables before estimating the
regression model? Normalizing gives linear terms the same a priori influence
as a typical rule, by dividing the (winsorized) linear term by 2.5 times its
SD. |
standardize |
logical. Should rules and linear terms be standardized to
have SD equal to 1 before estimating the regression model? This will also
standardize the dummified factors, users are advised to use the default
|
ordinal |
logical. Should ordinal variables (i.e., ordered factors) be
treated as continuous for generating rules? |
nfolds |
positive integer. Number of cross-validation folds to be used for
selecting the optimal value of the penalty parameter |
tree.control |
list with control parameters to be passed to the tree
fitting function, generated using |
tree.unbiased |
logical. Should an unbiased tree generation algorithm
be employed for rule generation? Defaults to |
removecomplements |
logical. Remove rules from the ensemble which are identical to (1 - an earlier rule)? |
removeduplicates |
logical. Remove rules from the ensemble which are identical to an earlier rule? |
verbose |
logical. Should progress be printed to the command line? |
par.init |
logical. Should parallel |
par.final |
logical. Should parallel |
sparse |
logical. Should sparse design matrices be used? May improve computation times for large datasets. |
... |
Further arguments to be passed to
|
Details
Note that obervations with missing values will be removed prior to analysis (and a warning printed).
In some cases, duplicated variable names may appear in the model. For example, the first variable is a factor named 'V1' and there are also variables named 'V10' and/or 'V11' and/or 'V12' (etc). Then for for the binary factor V1, dummy contrast variables will be created, named 'V10', 'V11', 'V12' (etc). As should be clear from this example, this yields duplicated variable names, which may yield problems, for example in the calculation of predictions and importances, later on. This can be prevented by renaming factor variables with numbers in their name, prior to analysis.
The table below provides an overview of combinations of response
variable types, use.grad
, tree.unbiased
and
learnrate
settings that are supported, and the tree induction
algorithm that will be employed as a result:
use.grad | tree.unbiased | learnrate | family | tree alg. | Response variable format |
TRUE | TRUE | 0 | gaussian | ctree | Single, numeric (non-integer) |
TRUE | TRUE | 0 | mgaussian | ctree | Multiple, numeric (non-integer) |
TRUE | TRUE | 0 | binomial | ctree | Single, factor with 2 levels |
TRUE | TRUE | 0 | multinomial | ctree | Single, factor with >2 levels |
TRUE | TRUE | 0 | poisson | ctree | Single, integer |
TRUE | TRUE | 0 | cox | ctree | Object of class 'Surv' |
TRUE | TRUE | >0 | gaussian | ctree | Single, numeric (non-integer) |
TRUE | TRUE | >0 | mgaussian | ctree | Multiple, numeric (non-integer) |
TRUE | TRUE | >0 | binomial | ctree | Single, factor with 2 levels |
TRUE | TRUE | >0 | multinomial | ctree | Single, factor with >2 levels |
TRUE | TRUE | >0 | poisson | ctree | Single, integer |
TRUE | TRUE | >0 | cox | ctree | Object of class 'Surv' |
FALSE | TRUE | 0 | gaussian | glmtree | Single, numeric (non-integer) |
FALSE | TRUE | 0 | binomial | glmtree | Single, factor with 2 levels |
FALSE | TRUE | 0 | poisson | glmtree | Single, integer |
FALSE | TRUE | >0 | gaussian | glmtree | Single, numeric (non-integer) |
FALSE | TRUE | >0 | binomial | glmtree | Single, factor with 2 levels |
FALSE | TRUE | >0 | poisson | glmtree | Single, integer |
TRUE | FALSE | 0 | gaussian | rpart | Single, numeric (non-integer) |
TRUE | FALSE | 0 | binomial | rpart | Single, factor with 2 levels |
TRUE | FALSE | 0 | multinomial | rpart | Single, factor with >2 levels |
TRUE | FALSE | 0 | poisson | rpart | Single, integer |
TRUE | FALSE | 0 | cox | rpart | Object of class 'Surv' |
TRUE | FALSE | >0 | gaussian | rpart | Single, numeric (non-integer) |
TRUE | FALSE | >0 | binomial | rpart | Single, factor with 2 levels |
TRUE | FALSE | >0 | poisson | rpart | Single, integer |
TRUE | FALSE | >0 | cox | rpart | Object of class 'Surv' |
If an error along the lines of 'factor ... has new levels ...' is encountered,
consult ?rare_level_sampler
for explanation and solutions.
Value
An object of class pre
. It contains the initial ensemble of
rules and/or linear terms and a range of possible final ensembles.
By default, the final ensemble employed by all other
methods and functions in package pre
is selected using the 'minimum
cross validated error plus 1 standard error' criterion. All functions and
methods for objects of class pre
take a penalty.parameter.val
argument, which can be used to select a different criterion.
If only a set of rules needs to be generated, but the final regression model
should not be fitted, specify the hidden argument fit.final = FALSE
.
Note
Parts of the code for deriving rules from the nodes of trees was copied
with permission from an internal function of the partykit
package, written
by Achim Zeileis and Torsten Hothorn.
References
Fokkema, M. (2020). Fitting prediction rule ensembles with R package pre. Journal of Statistical Software, 92(12), 1-30. doi:10.18637/jss.v092.i12
Fokkema, M. & Strobl, C. (2020). Fitting prediction rule ensembles to psychological research data: An introduction and tutorial. Psychological Methods 25(5), 636-652. doi:10.1037/met0000256, https://arxiv.org/abs/1907.05302
Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine. The Annals of Applied Statistics, 29(5), 1189-1232.
Friedman, J. H., & Popescu, B. E. (2008). Predictive learning via rule ensembles. The Annals of Applied Statistics, 2(3), 916-954, doi:10.1214/07-AOAS148.
Hothorn, T., & Zeileis, A. (2015). partykit: A modular toolkit for recursive partytioning in R. Journal of Machine Learning Research, 16, 3905-3909.
See Also
print.pre
, plot.pre
,
coef.pre
, importance.pre
, predict.pre
,
interact
, cvpre
Examples
## Fit pre to a continuous response:
airq <- airquality[complete.cases(airquality), ]
set.seed(42)
airq.ens <- pre(Ozone ~ ., data = airq)
airq.ens
## Use relaxed lasso to estimate final model
airq.ens.rel <- pre(Ozone ~ ., data = airq, relax = TRUE)
airq.ens.rel
## Use adaptive lasso to estimate final model
airq.ens.ad <- pre(Ozone ~ ., data = airq, ad.alpha = 0)
airq.ens.ad
## Fit pre to a binary response:
airq2 <- airquality[complete.cases(airquality), ]
airq2$Ozone <- factor(airq2$Ozone > median(airq2$Ozone))
set.seed(42)
airq.ens2 <- pre(Ozone ~ ., data = airq2, family = "binomial")
airq.ens2
## Fit pre to a multivariate continuous response:
airq3 <- airquality[complete.cases(airquality), ]
set.seed(42)
airq.ens3 <- pre(Ozone + Wind ~ ., data = airq3, family = "mgaussian")
airq.ens3
## Fit pre to a multinomial response:
set.seed(42)
iris.ens <- pre(Species ~ ., data = iris, family = "multinomial")
iris.ens
## Fit pre to a survival response:
library("survival")
lung <- lung[complete.cases(lung), ]
set.seed(42)
lung.ens <- pre(Surv(time, status) ~ ., data = lung, family = "cox")
lung.ens
## Fit pre to a count response:
## Generate random data (partly based on Dobson (1990) Page 93: Randomized
## Controlled Trial):
counts <- rep(as.integer(c(18, 17, 15, 20, 10, 20, 25, 13, 12)), times = 10)
outcome <- rep(gl(3, 1, 9), times = 10)
treatment <- rep(gl(3, 3), times = 10)
noise1 <- 1:90
set.seed(1)
noise2 <- rnorm(90)
countdata <- data.frame(treatment, outcome, counts, noise1, noise2)
set.seed(42)
count.ens <- pre(counts ~ ., data = countdata, family = "poisson")
count.ens