cv.prune {logicDT}R Documentation

Optimal pruning via cross-validation

Description

Using a fitted logicDT model, its logic decision tree can be optimally (post-)pruned utilizing k-fold cross-validation.

Usage

cv.prune(
  model,
  nfolds = 10,
  scoring_rule = "deviance",
  choose = "1se",
  simplify = TRUE
)

Arguments

model

A fitted logicDT model

nfolds

Number of cross-validation folds

scoring_rule

The scoring rule for evaluating the cross-validation error and its standard error. For classification tasks, "deviance" or "Brier" should be used.

choose

Model selection scheme. If the model that minimizes the cross-validation error should be chosen, choose = "min" should be set. Otherwise, choose = "1se" leads to simplest model in the range of one standard error of the minimizing model.

simplify

Should the pruned model be simplified with regard to the input terms, i.e., should terms that are no longer in the tree contained be removed from the model?

Details

Similar to Breiman et al. (1984), we implement post-pruning by first computing the optimal pruning path and then using cross-validation for identifying the best generalizing model.

In order to handle continuous covariables with fitted regression models in each leaf, similar to the likelihood-ratio splitting criterion in logicDT, we propose using the log-likelihood as the impurity criterion in this case for computing the pruning path. In particular, for each node tt, the weighted node impurity p(t)i(t)p(t)i(t) has to be calculated and the inequality

Δi(s,t):=i(t)p(tLt)i(tL)p(tRt)i(tR)0\Delta i(s,t) := i(t) - p(t_L | t)i(t_L) - p(t_R | t)i(t_R) \geq 0

has to be fulfilled for each possible split ss splitting tt into two subnodes tLt_L and tRt_R. Here, i(t)i(t) describes the impurity of a node tt, p(t)p(t) the proportion of data points falling into tt, and p(tt)p(t' | t) the proportion of data points falling from tt into tt'. Since the regression models are fitted using maximum likelihood, the maximum likelihood criterion fulfills this property and can also be seen as an extension of the entropy impurity criterion in the case of classification or an extension of the MSE impurity criterion in the case of regression.

The default model selection is done by choosing the most parsimonious model that yields a cross-validation error in the range of CVmin+SEmin\mathrm{CV}_{\min} + \mathrm{SE}_{\min} for the minimal cross-validation error CVmin\mathrm{CV}_{\min} and its corresponding standard error SEmin\mathrm{SE}_{\min}. For a more robust standard error estimation, the scores are calculated per training observation such that the AUC is no longer an appropriate choice and the deviance or the Brier score should be used in the case of classification.

Value

A list containing

model

The new logicDT model containing the optimally pruned tree

cv.res

A data frame containing the penalties, the cross-validation scores and the corresponding standard errors

best.beta

The ideal penalty value

References


[Package logicDT version 1.0.4 Index]