RePrInDT {PrInDT}R Documentation

Repeated PrInDT for specified percentage combinations

Description

PrInDT is called repeatedly according to the percentages specified in the vectors 'plarge' and 'psmall'.
The relationship between the two-class factor variable 'classname' and all other factor and numerical variables in the data frame 'datain' is optimally modeled by means of 'N' repetitions of undersampling.
The trees generated from undersampling can be restricted by rejecting unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.

Reference
Weihs, C., Buschfeld, S. 2021c. Repeated undersampling in PrInDT (RePrInDT): Variation in undersampling and prediction, and ranking of predictors in ensembles. arXiv:2108.05129

Usage

RePrInDT(datain, classname, ctestv=NA, N, plarge, psmall, conf.level=0.95,
       thres=0.5, stratvers=0, strat=NA, seedl=TRUE)

Arguments

datain

Input data frame with class factor variable 'classname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)

classname

Name of class variable (character)

ctestv

Vector of character strings of forbidden split results;
see function PrInDT for details.
If no restrictions exist, the default = NA is used.

N

Number of repetitions (integer > 0)

plarge

Vector of undersampling percentages of larger class (numerical, > 0 and <= 1)

psmall

Vector of undersampling percentages of smaller class (numerical, > 0 and <= 1)

conf.level

(1 - significance level) in function ctree (numerical, > 0 and <= 1);
default = 0.95

thres

Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5

stratvers

Version of stratification;
= 0: none (default),
= 1: stratification according to the percentages of the values of the factor variable 'strat',
> 1: stratification with minimum number 'stratvers' of observations per value of 'strat'

strat

Name of one (!) stratification variable for undersampling (character);
default = NA (no stratification)

seedl

Should the seed for random numbers be set (TRUE / FALSE)?
default = TRUE

Details

Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.

Value

treesb

best trees for the different percentage combinations; refer to an individual tree as treesb[[k]], k = 1, ..., length(plarge)*length(psmall)

acc1st

accuracies of best trees on full sample

acc3en

accuracies of ensemble of 3 best trees on full sample

simp_m

mean of permutation losses for the predictors

Examples

datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}', 'MLU == {1, 3}')
N <- 51  # no. of repetitions
conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree
psmall <- c(0.95,1)     # percentages of the small class
plarge <- c(0.09,0.1)  # percentages of the large class
outRe <- RePrInDT(data,"real",ctestv,N,plarge,psmall,conf.level) # might take 5 minutes
outRe
plot(outRe)


[Package PrInDT version 1.0.1 Index]