R: Nested 'PrInDT' with additional undersampling of a factor...

NesPrInDT {PrInDT}

R Documentation

Nested `PrInDT` with additional undersampling of a factor with two unbalanced levels

Description

Function for additional undersampling of the factor 'nesvar' with two unbalanced levels to avoid dominance of the level with higher frequency. The factor 'nesvar' is allowed not be part of the input data frame 'datain'. The data of this factor is given in the vector 'nesunder'. The observations in 'nesunder' have to represent the same cases as in 'datain' in the same ordering.
PrInDT is called 'repin' times with subsamples of the original data so that the level with the larger frequency in the vector 'nesunder' has approximately the same number of values as the level with the smaller frequency.
Only the arguments 'nesvar', 'nesunder', and 'repin' relate to the additional undersampling, all the other arguments relate to the standard PrInDT procedure.
As in PrInDT, the aim is to optimally model the relationship between the two-class factor variable 'classname' and all other factor and numerical variables in the data frame 'datain' by means of 'N' repetitions of undersampling. The trees generated by PrInDT can be restricted by excluding unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.
The results are evaluated on the full sample and on the subsamples of 'nesunder'.

Reference
Weihs, C., Buschfeld, S. 2021b. NesPrInDT: Nested undersampling in PrInDT. arXiv:2103.14931

Usage

NesPrInDT(datain, classname, ctestv=NA, N, plarge, psmall=1.0, conf.level=0.95,
       thres=0.5, stratvers=0, strat=NA, seedl=TRUE, nesvar, nesunder, repin)

Arguments

`datain`	Input data frame with class factor variable 'classname' and the influential variables, which need to be factors or numericals (transform logicals and character variables to factors)
`classname`	Name of class variable (character)
`ctestv`	Vector of character strings of forbidden split results; see function `PrInDT` for details. If no restrictions exist, the default = NA is used.
`N`	Number of repetitions (integer > 0)
`plarge`	Undersampling percentage of larger class (numerical, > 0 and <= 1)
`psmall`	Undersampling percentage of smaller class (numerical, > 0 and <= 1); default = 1
`conf.level`	(1 - significance level) in function `ctree` (numerical, > 0 and <= 1); default = 0.95
`thres`	Probability threshold for prediction of smaller class; default = 0.5
`stratvers`	Version of stratification; = 0: none (default), = 1: stratification according to the percentages of the values of the factor variable 'strat', > 1: stratification with minimum number 'stratvers' of observations per value of 'strat'
`strat`	Name of one (!) stratification variable for undersampling (character); default = NA (no stratification)
`seedl`	Should the seed for random numbers be set (TRUE / FALSE)? default = TRUE
`nesvar`	Name of factor to be undersampled (character)
`nesunder`	Data of factor to be undersampled (integer)
`repin`	Number of repetitions (integer) for undersampling of 'nesvar'

Details

Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.

Value

undba: balanced accuracies on undersamples
imax: indices of best trees on undersamples
undba3en: balanced accuracies of ensembles of 3 best trees on undersamples
accF: balanced accuracies on full sample
accE: balanced accuracy on full sample of best ensemble of 3 trees from undersampling
maxt: indices of best trees on full sample
treesb: 3 best trees of all undersamples of 'nesunder'; refer to an individual tree as treesb[[k]], k = 1, ..., 3*repin

Examples

# data input and preparation --> data frame with 
#   class variable, factors, and numericals (no character variables)!!
data <- PrInDT::data_speaker
data <- na.omit(data)
nesvar <- "SPEAKER"
N <- 49  # no. of repetitions in inner loop
plarge <- 0.06 # sampling percentage for larger class in nesunder-subsample
psmall <- 1 # sampling percentage for smaller class in nesunder-subsample
nesunder <- data$SPEAKER
data[,nesvar] <- list(NULL)
outNes <- NesPrInDT(data,"class",ctestv=NA,N,plarge,psmall,conf.level=0.95,nesvar=nesvar,
  nesunder=nesunder,repin=5)
outNes
plot(outNes)
hist(outNes$undba,main=" ",xlab = "balanced accuracies of 3 best trees of all undersamples")

[Package PrInDT version 1.0.1 Index]