PrInDTreg {PrInDT}R Documentation

Regression tree resampling by the PrInDT method

Description

Regression tree optimzation to identify the best interpretable tree; interpretability is checked (see 'ctestv').
The relationship between the target variable 'regname' and all other factor and numerical variables in the data frame 'datain' is optimally modeled by means of 'N' repetitions of subsampling.
The optimization criterion is the R2 of the model on the full sample.
Multiple subsampling percentages of observations and predictors can be specified (in 'pobs' and 'ppre', correspondingly).
The trees generated from undersampling can be restricted by rejecting unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.

Usage

PrInDTreg(datain, regname, ctestv=NA, N, pobs, ppre, conf.level=0.95)

Arguments

datain

Input data frame with class factor variable 'classname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)

regname

name of regressand variable (character)

ctestv

Vector of character strings of forbidden split results;
see function PrInDT for details.
If no restrictions exist, the default = NA is used.

N

Number of repetitions (integer > 0)

pobs

Vector of resampling percentages of observations (numerical, > 0 and <= 1)

ppre

Vector of resampling percentages of predictor variables (numerical, > 0 and <= 1)

conf.level

(1 - significance level) in function ctree (numerical, > 0 and <= 1);
default = 0.95

Details

For the optimzation of the trees, we employ a method we call Sumping (Subsampling umbrella of model parameters), a variant of Bumping (Bootstrap umbrella of model parameters) (Tibshirani & Knight, 1999) which use subsampling instead of bootstrapping. The aim of the optimization is to identify conditional inference trees with maximum predictive power on the full sample under interpretability restrictions.

Reference
Tibshirani, R., Knight, K. 1999. Model Search and Inference By Bootstrap "bumping". Journal of Computational and Graphical Statistics, Vol. 8, No. 4 (Dec., 1999), pp. 671-686

Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.

Value

meanint

Mean number of interpretable trees over the combinations of individual percentages in 'pobs' and 'ppre'

R2mean

Mean R2 on test sets

ctmax

best resampled regression tree according to R2 on the full data set

percmax

Maximum R2 achieved for %observations

perfeamax

Maximum R2 achieved for %predictors

maxR2

best R2 on the full data set for resampled regression trees (for 'ctmax')

interpmax

interpretability of best tree 'ctmax'

ctmax2

second best resampled regression tree according to R2 on the full data set

percmax2

second best R2 achieved for %observations

perfeamax2

second best R2 achieved for %features

max2R2

second best R2 on the full data set for resampled regression trees (for 'ctmax2')

interp2max

interpretability of second-best tree 'ctmax2'

Examples

data <- PrInDT::data_vowel
data <- na.omit(data)
ctestv <- 'vowel_maximum_pitch <= 320'
N <- 30 # no. of repetitions
pobs <- c(0.70,0.60) # percentages of observations
ppre <- c(0.90,0.70) # percentages of predictors
outreg <- PrInDTreg(data,"target",ctestv,N,pobs,ppre)
outreg
plot(outreg)


[Package PrInDT version 1.0.1 Index]