cv.prune {logicDT} | R Documentation |
Optimal pruning via cross-validation
Description
Using a fitted logicDT
model, its logic decision tree can be
optimally (post-)pruned utilizing k-fold cross-validation.
Usage
cv.prune(
model,
nfolds = 10,
scoring_rule = "deviance",
choose = "1se",
simplify = TRUE
)
Arguments
model |
A fitted |
nfolds |
Number of cross-validation folds |
scoring_rule |
The scoring rule for evaluating the cross-validation
error and its standard error. For classification tasks, |
choose |
Model selection scheme. If the model that minimizes the
cross-validation error should be chosen, |
simplify |
Should the pruned model be simplified with regard to the input terms, i.e., should terms that are no longer in the tree contained be removed from the model? |
Details
Similar to Breiman et al. (1984), we implement post-pruning by first computing the optimal pruning path and then using cross-validation for identifying the best generalizing model.
In order to handle continuous covariables with fitted regression models in
each leaf, similar to the likelihood-ratio splitting criterion in
logicDT
, we propose using the log-likelihood as the impurity
criterion in this case for computing the pruning path.
In particular, for each node t
, the weighted node impurity
p(t)i(t)
has to be calculated and the inequality
\Delta i(s,t) := i(t) - p(t_L | t)i(t_L) - p(t_R | t)i(t_R) \geq 0
has to be fulfilled for each possible split s
splitting t
into
two subnodes t_L
and t_R
. Here, i(t)
describes the
impurity of a node t
, p(t)
the proportion of data points falling
into t
, and p(t' | t)
the proportion of data points falling
from t
into t'
.
Since the regression models are fitted using maximum likelihood, the
maximum likelihood criterion fulfills this property and can also be seen as
an extension of the entropy impurity criterion in the case of classification
or an extension of the MSE impurity criterion in the case of regression.
The default model selection is done by choosing the most parsimonious model
that yields a cross-validation error in the range of
\mathrm{CV}_{\min} + \mathrm{SE}_{\min}
for the minimal cross-validation error \mathrm{CV}_{\min}
and its
corresponding standard error \mathrm{SE}_{\min}
.
For a more robust standard error estimation, the scores are calculated per
training observation such that the AUC is no longer an appropriate choice
and the deviance or the Brier score should be used in the case of
classification.
Value
A list containing
model |
The new |
cv.res |
A data frame containing the penalties, the cross-validation scores and the corresponding standard errors |
best.beta |
The ideal penalty value |
References
Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi: 10.1201/9781315139470