R: Optimal pruning via cross-validation

cv.prune {logicDT}

R Documentation

Optimal pruning via cross-validation

Description

Using a fitted logicDT model, its logic decision tree can be optimally (post-)pruned utilizing k-fold cross-validation.

Usage

cv.prune(
  model,
  nfolds = 10,
  scoring_rule = "deviance",
  choose = "1se",
  simplify = TRUE
)

Arguments

`model`	A fitted `logicDT` model
`nfolds`	Number of cross-validation folds
`scoring_rule`	The scoring rule for evaluating the cross-validation error and its standard error. For classification tasks, `"deviance"` or `"Brier"` should be used.
`choose`	Model selection scheme. If the model that minimizes the cross-validation error should be chosen, `choose = "min"` should be set. Otherwise, `choose = "1se"` leads to simplest model in the range of one standard error of the minimizing model.
`simplify`	Should the pruned model be simplified with regard to the input terms, i.e., should terms that are no longer in the tree contained be removed from the model?

Details

Similar to Breiman et al. (1984), we implement post-pruning by first computing the optimal pruning path and then using cross-validation for identifying the best generalizing model.

In order to handle continuous covariables with fitted regression models in each leaf, similar to the likelihood-ratio splitting criterion in logicDT, we propose using the log-likelihood as the impurity criterion in this case for computing the pruning path. In particular, for each node t, the weighted node impurity p(t)i(t) has to be calculated and the inequality

\Delta i(s,t) := i(t) - p(t_L | t)i(t_L) - p(t_R | t)i(t_R) \geq 0

has to be fulfilled for each possible split s splitting t into two subnodes t_L and t_R. Here, i(t) describes the impurity of a node t, p(t) the proportion of data points falling into t, and p(t' | t) the proportion of data points falling from t into t'. Since the regression models are fitted using maximum likelihood, the maximum likelihood criterion fulfills this property and can also be seen as an extension of the entropy impurity criterion in the case of classification or an extension of the MSE impurity criterion in the case of regression.

The default model selection is done by choosing the most parsimonious model that yields a cross-validation error in the range of \mathrm{CV}_{\min} + \mathrm{SE}_{\min} for the minimal cross-validation error \mathrm{CV}_{\min} and its corresponding standard error \mathrm{SE}_{\min}. For a more robust standard error estimation, the scores are calculated per training observation such that the AUC is no longer an appropriate choice and the deviance or the Brier score should be used in the case of classification.

Value

A list containing

`model`	The new `logicDT` model containing the optimally pruned tree
`cv.res`	A data frame containing the penalties, the cross-validation scores and the corresponding standard errors
`best.beta`	The ideal penalty value

References

Breiman, L., Friedman, J., Stone, C. J. & Olshen, R. A. (1984). Classification and Regression Trees. CRC Press. doi: 10.1201/9781315139470

[Package logicDT version 1.0.4 Index]