NesPrInDT {PrInDT}R Documentation

Nested PrInDT with additional undersampling of a factor with two unbalanced levels

Description

Function for additional undersampling of the factor 'nesvar' with two unbalanced levels to avoid dominance of the level with higher frequency. The factor 'nesvar' is allowed not be part of the input data frame 'datain'. The data of this factor is given in the vector 'nesunder'. The observations in 'nesunder' have to represent the same cases as in 'datain' in the same ordering.
PrInDT is called 'repin' times with subsamples of the original data so that the level with the larger frequency in the vector 'nesunder' has approximately the same number of values as the level with the smaller frequency.
Only the arguments 'nesvar', 'nesunder', and 'repin' relate to the additional undersampling, all the other arguments relate to the standard PrInDT procedure.
As in PrInDT, the aim is to optimally model the relationship between the two-class factor variable 'classname' and all other factor and numerical variables in the data frame 'datain' by means of 'N' repetitions of undersampling. The trees generated by PrInDT can be restricted by excluding unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.
The results are evaluated on the full sample and on the subsamples of 'nesunder'.

Reference
Weihs, C., Buschfeld, S. 2021b. NesPrInDT: Nested undersampling in PrInDT. arXiv:2103.14931

Usage

NesPrInDT(datain, classname, ctestv=NA, N, plarge, psmall=1.0, conf.level=0.95,
       thres=0.5, stratvers=0, strat=NA, seedl=TRUE, nesvar, nesunder, repin)

Arguments

datain

Input data frame with class factor variable 'classname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)

classname

Name of class variable (character)

ctestv

Vector of character strings of forbidden split results;
see function PrInDT for details.
If no restrictions exist, the default = NA is used.

N

Number of repetitions (integer > 0)

plarge

Undersampling percentage of larger class (numerical, > 0 and <= 1)

psmall

Undersampling percentage of smaller class (numerical, > 0 and <= 1);
default = 1

conf.level

(1 - significance level) in function ctree (numerical, > 0 and <= 1);
default = 0.95

thres

Probability threshold for prediction of smaller class; default = 0.5

stratvers

Version of stratification;
= 0: none (default),
= 1: stratification according to the percentages of the values of the factor variable 'strat',
> 1: stratification with minimum number 'stratvers' of observations per value of 'strat'

strat

Name of one (!) stratification variable for undersampling (character);
default = NA (no stratification)

seedl

Should the seed for random numbers be set (TRUE / FALSE)?
default = TRUE

nesvar

Name of factor to be undersampled (character)

nesunder

Data of factor to be undersampled (integer)

repin

Number of repetitions (integer) for undersampling of 'nesvar'

Details

Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.

Value

undba

balanced accuracies on undersamples

imax

indices of best trees on undersamples

undba3en

balanced accuracies of ensembles of 3 best trees on undersamples

accF

balanced accuracies on full sample

accE

balanced accuracy on full sample of best ensemble of 3 trees from undersampling

maxt

indices of best trees on full sample

treesb

3 best trees of all undersamples of 'nesunder'; refer to an individual tree as treesb[[k]], k = 1, ..., 3*repin

Examples

# data input and preparation --> data frame with 
#   class variable, factors, and numericals (no character variables)!!
data <- PrInDT::data_speaker
data <- na.omit(data)
nesvar <- "SPEAKER"
N <- 49  # no. of repetitions in inner loop
plarge <- 0.06 # sampling percentage for larger class in nesunder-subsample
psmall <- 1 # sampling percentage for smaller class in nesunder-subsample
nesunder <- data$SPEAKER
data[,nesvar] <- list(NULL)
outNes <- NesPrInDT(data,"class",ctestv=NA,N,plarge,psmall,conf.level=0.95,nesvar=nesvar,
  nesunder=nesunder,repin=5)
outNes
plot(outNes)
hist(outNes$undba,main=" ",xlab = "balanced accuracies of 3 best trees of all undersamples")


[Package PrInDT version 1.0.1 Index]