R: Repeated 'PrInDT' for specified percentage combinations

RePrInDT {PrInDT}

R Documentation

Repeated `PrInDT` for specified percentage combinations

Description

PrInDT is called repeatedly according to the percentages specified in the vectors 'plarge' and 'psmall'.
The relationship between the two-class factor variable 'classname' and all other factor and numerical variables in the data frame 'datain' is optimally modeled by means of 'N' repetitions of undersampling.
The trees generated from undersampling can be restricted by rejecting unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.

Reference
Weihs, C., Buschfeld, S. 2021c. Repeated undersampling in PrInDT (RePrInDT): Variation in undersampling and prediction, and ranking of predictors in ensembles. arXiv:2108.05129

Usage

RePrInDT(datain, classname, ctestv=NA, N, plarge, psmall, conf.level=0.95,
       thres=0.5, stratvers=0, strat=NA, seedl=TRUE)

Arguments

`datain`	Input data frame with class factor variable 'classname' and the influential variables, which need to be factors or numericals (transform logicals and character variables to factors)
`classname`	Name of class variable (character)
`ctestv`	Vector of character strings of forbidden split results; see function `PrInDT` for details. If no restrictions exist, the default = NA is used.
`N`	Number of repetitions (integer > 0)
`plarge`	Vector of undersampling percentages of larger class (numerical, > 0 and <= 1)
`psmall`	Vector of undersampling percentages of smaller class (numerical, > 0 and <= 1)
`conf.level`	(1 - significance level) in function `ctree` (numerical, > 0 and <= 1); default = 0.95
`thres`	Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5
`stratvers`	Version of stratification; = 0: none (default), = 1: stratification according to the percentages of the values of the factor variable 'strat', > 1: stratification with minimum number 'stratvers' of observations per value of 'strat'
`strat`	Name of one (!) stratification variable for undersampling (character); default = NA (no stratification)
`seedl`	Should the seed for random numbers be set (TRUE / FALSE)? default = TRUE

Details

Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.

Value

treesb: best trees for the different percentage combinations; refer to an individual tree as treesb[[k]], k = 1, ..., length(plarge)*length(psmall)
acc1st: accuracies of best trees on full sample
acc3en: accuracies of ensemble of 3 best trees on full sample
simp_m: mean of permutation losses for the predictors

Examples

datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}', 'MLU == {1, 3}')
N <- 51  # no. of repetitions
conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree
psmall <- c(0.95,1)     # percentages of the small class
plarge <- c(0.09,0.1)  # percentages of the large class
outRe <- RePrInDT(data,"real",ctestv,N,plarge,psmall,conf.level) # might take 5 minutes
outRe
plot(outRe)

[Package PrInDT version 1.0.1 Index]