PrInDTreg {PrInDT} | R Documentation |
Regression tree resampling by the PrInDT method
Description
Regression tree optimzation to identify the best interpretable tree; interpretability is checked (see 'ctestv').
The relationship between the target variable 'regname' and all other factor and numerical variables
in the data frame 'datain' is optimally modeled by means of 'N' repetitions of subsampling.
The optimization criterion is the R2 of the model on the full sample.
Multiple subsampling percentages of observations and predictors can be specified (in 'pobs' and 'ppre', correspondingly).
The trees generated from undersampling can be restricted by
rejecting unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
Usage
PrInDTreg(datain, regname, ctestv=NA, N, pobs, ppre, conf.level=0.95)
Arguments
datain |
Input data frame with class factor variable 'classname' and the |
regname |
name of regressand variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number of repetitions (integer > 0) |
pobs |
Vector of resampling percentages of observations (numerical, > 0 and <= 1) |
ppre |
Vector of resampling percentages of predictor variables (numerical, > 0 and <= 1) |
conf.level |
(1 - significance level) in function |
Details
For the optimzation of the trees, we employ a method we call Sumping (Subsampling umbrella of model parameters), a variant of Bumping (Bootstrap umbrella of model parameters) (Tibshirani & Knight, 1999) which use subsampling instead of bootstrapping. The aim of the optimization is to identify conditional inference trees with maximum predictive power on the full sample under interpretability restrictions.
Reference
Tibshirani, R., Knight, K. 1999. Model Search and Inference By Bootstrap "bumping".
Journal of Computational and Graphical Statistics, Vol. 8, No. 4 (Dec., 1999), pp. 671-686
Standard output can be produced by means of print(name)
or just name
as well as plot(name)
where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE)
before
plot(name)
to save the whole series of plots. In R-Studio this functionality is provided automatically.
Value
- meanint
Mean number of interpretable trees over the combinations of individual percentages in 'pobs' and 'ppre'
- R2mean
Mean R2 on test sets
- ctmax
best resampled regression tree according to R2 on the full data set
- percmax
Maximum R2 achieved for %observations
- perfeamax
Maximum R2 achieved for %predictors
- maxR2
best R2 on the full data set for resampled regression trees (for 'ctmax')
- interpmax
interpretability of best tree 'ctmax'
- ctmax2
second best resampled regression tree according to R2 on the full data set
- percmax2
second best R2 achieved for %observations
- perfeamax2
second best R2 achieved for %features
- max2R2
second best R2 on the full data set for resampled regression trees (for 'ctmax2')
- interp2max
interpretability of second-best tree 'ctmax2'
Examples
data <- PrInDT::data_vowel
data <- na.omit(data)
ctestv <- 'vowel_maximum_pitch <= 320'
N <- 30 # no. of repetitions
pobs <- c(0.70,0.60) # percentages of observations
ppre <- c(0.90,0.70) # percentages of predictors
outreg <- PrInDTreg(data,"target",ctestv,N,pobs,ppre)
outreg
plot(outreg)