R: Regression tree resampling by the PrInDT method

PrInDTreg {PrInDT}

R Documentation

Regression tree resampling by the PrInDT method

Description

Regression tree optimzation to identify the best interpretable tree; interpretability is checked (see 'ctestv').
The relationship between the target variable 'regname' and all other factor and numerical variables in the data frame 'datain' is optimally modeled by means of 'N' repetitions of subsampling.
The optimization criterion is the R2 of the model on the full sample.
Multiple subsampling percentages of observations and predictors can be specified (in 'pobs' and 'ppre', correspondingly).
The trees generated from undersampling can be restricted by rejecting unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.

Usage

PrInDTreg(datain, regname, ctestv=NA, N, pobs, ppre, conf.level=0.95)

Arguments

`datain`	Input data frame with class factor variable 'classname' and the influential variables, which need to be factors or numericals (transform logicals and character variables to factors)
`regname`	name of regressand variable (character)
`ctestv`	Vector of character strings of forbidden split results; see function `PrInDT` for details. If no restrictions exist, the default = NA is used.
`N`	Number of repetitions (integer > 0)
`pobs`	Vector of resampling percentages of observations (numerical, > 0 and <= 1)
`ppre`	Vector of resampling percentages of predictor variables (numerical, > 0 and <= 1)
`conf.level`	(1 - significance level) in function `ctree` (numerical, > 0 and <= 1); default = 0.95

Details

For the optimzation of the trees, we employ a method we call Sumping (Subsampling umbrella of model parameters), a variant of Bumping (Bootstrap umbrella of model parameters) (Tibshirani & Knight, 1999) which use subsampling instead of bootstrapping. The aim of the optimization is to identify conditional inference trees with maximum predictive power on the full sample under interpretability restrictions.

Reference
Tibshirani, R., Knight, K. 1999. Model Search and Inference By Bootstrap "bumping". Journal of Computational and Graphical Statistics, Vol. 8, No. 4 (Dec., 1999), pp. 671-686

Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.

Value

meanint: Mean number of interpretable trees over the combinations of individual percentages in 'pobs' and 'ppre'
R2mean: Mean R2 on test sets
ctmax: best resampled regression tree according to R2 on the full data set
percmax: Maximum R2 achieved for %observations
perfeamax: Maximum R2 achieved for %predictors
maxR2: best R2 on the full data set for resampled regression trees (for 'ctmax')
interpmax: interpretability of best tree 'ctmax'
ctmax2: second best resampled regression tree according to R2 on the full data set
percmax2: second best R2 achieved for %observations
perfeamax2: second best R2 achieved for %features
max2R2: second best R2 on the full data set for resampled regression trees (for 'ctmax2')
interp2max: interpretability of second-best tree 'ctmax2'

Examples

data <- PrInDT::data_vowel
data <- na.omit(data)
ctestv <- 'vowel_maximum_pitch <= 320'
N <- 30 # no. of repetitions
pobs <- c(0.70,0.60) # percentages of observations
ppre <- c(0.90,0.70) # percentages of predictors
outreg <- PrInDTreg(data,"target",ctestv,N,pobs,ppre)
outreg
plot(outreg)

[Package PrInDT version 1.0.1 Index]