NesPrInDT {PrInDT} | R Documentation |
Nested PrInDT
with additional undersampling of a factor with two unbalanced levels
Description
Function for additional undersampling of the factor 'nesvar' with two unbalanced levels to avoid dominance of the level with higher frequency.
The factor 'nesvar' is allowed not be part of the input data frame 'datain'. The data of this factor is given in the vector 'nesunder'.
The observations in 'nesunder' have to represent the same cases as in 'datain' in the same ordering.
PrInDT
is called 'repin' times with subsamples of the original data so that the level with the larger frequency in the vector 'nesunder' has
approximately the same number of values as the level with the smaller frequency.
Only the arguments 'nesvar', 'nesunder', and 'repin' relate to the additional undersampling, all the other arguments relate to the standard
PrInDT
procedure.
As in PrInDT
, the aim is to optimally model the relationship between the two-class factor variable 'classname' and all other factor and
numerical variables in the data frame 'datain' by means of 'N' repetitions of undersampling. The trees generated by PrInDT
can be
restricted by excluding unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.
The results are evaluated on the full sample and on the subsamples of 'nesunder'.
Reference
Weihs, C., Buschfeld, S. 2021b. NesPrInDT: Nested undersampling in PrInDT.
arXiv:2103.14931
Usage
NesPrInDT(datain, classname, ctestv=NA, N, plarge, psmall=1.0, conf.level=0.95,
thres=0.5, stratvers=0, strat=NA, seedl=TRUE, nesvar, nesunder, repin)
Arguments
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number of repetitions (integer > 0) |
plarge |
Undersampling percentage of larger class (numerical, > 0 and <= 1) |
psmall |
Undersampling percentage of smaller class (numerical, > 0 and <= 1); |
conf.level |
(1 - significance level) in function |
thres |
Probability threshold for prediction of smaller class; default = 0.5 |
stratvers |
Version of stratification; |
strat |
Name of one (!) stratification variable for undersampling (character); |
seedl |
Should the seed for random numbers be set (TRUE / FALSE)? |
nesvar |
Name of factor to be undersampled (character) |
nesunder |
Data of factor to be undersampled (integer) |
repin |
Number of repetitions (integer) for undersampling of 'nesvar' |
Details
Standard output can be produced by means of print(name)
or just name
as well as plot(name)
where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE)
before
plot(name)
to save the whole series of plots. In R-Studio this functionality is provided automatically.
Value
- undba
balanced accuracies on undersamples
- imax
indices of best trees on undersamples
- undba3en
balanced accuracies of ensembles of 3 best trees on undersamples
- accF
balanced accuracies on full sample
- accE
balanced accuracy on full sample of best ensemble of 3 trees from undersampling
- maxt
indices of best trees on full sample
- treesb
3 best trees of all undersamples of 'nesunder'; refer to an individual tree as
treesb[[k]]
, k = 1, ..., 3*repin
Examples
# data input and preparation --> data frame with
# class variable, factors, and numericals (no character variables)!!
data <- PrInDT::data_speaker
data <- na.omit(data)
nesvar <- "SPEAKER"
N <- 49 # no. of repetitions in inner loop
plarge <- 0.06 # sampling percentage for larger class in nesunder-subsample
psmall <- 1 # sampling percentage for smaller class in nesunder-subsample
nesunder <- data$SPEAKER
data[,nesvar] <- list(NULL)
outNes <- NesPrInDT(data,"class",ctestv=NA,N,plarge,psmall,conf.level=0.95,nesvar=nesvar,
nesunder=nesunder,repin=5)
outNes
plot(outNes)
hist(outNes$undba,main=" ",xlab = "balanced accuracies of 3 best trees of all undersamples")