RePrInDT {PrInDT} | R Documentation |
Repeated PrInDT
for specified percentage combinations
Description
PrInDT
is called repeatedly according to the percentages specified in the vectors 'plarge' and
'psmall'.
The relationship between the two-class factor variable 'classname' and all other factor and numerical variables
in the data frame 'datain' is optimally modeled by means of 'N' repetitions of undersampling.
The trees generated from undersampling can be restricted by rejecting
unacceptable trees which include split results specified in the character strings of the vector 'ctestv'.
The probability threshold 'thres' for the prediction of the smaller class may be specified (default = 0.5).
Undersampling may be stratified in two ways by the feature 'strat'.
Reference
Weihs, C., Buschfeld, S. 2021c. Repeated undersampling in PrInDT (RePrInDT): Variation in undersampling and prediction,
and ranking of predictors in ensembles. arXiv:2108.05129
Usage
RePrInDT(datain, classname, ctestv=NA, N, plarge, psmall, conf.level=0.95,
thres=0.5, stratvers=0, strat=NA, seedl=TRUE)
Arguments
datain |
Input data frame with class factor variable 'classname' and the |
classname |
Name of class variable (character) |
ctestv |
Vector of character strings of forbidden split results; |
N |
Number of repetitions (integer > 0) |
plarge |
Vector of undersampling percentages of larger class (numerical, > 0 and <= 1) |
psmall |
Vector of undersampling percentages of smaller class (numerical, > 0 and <= 1) |
conf.level |
(1 - significance level) in function |
thres |
Probability threshold for prediction of smaller class (numerical, >= 0 and < 1); default = 0.5 |
stratvers |
Version of stratification; |
strat |
Name of one (!) stratification variable for undersampling (character); |
seedl |
Should the seed for random numbers be set (TRUE / FALSE)? |
Details
Standard output can be produced by means of print(name)
or just name
as well as plot(name)
where 'name' is the output data
frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE)
before
plot(name)
to save the whole series of plots. In R-Studio this functionality is provided automatically.
Value
- treesb
best trees for the different percentage combinations; refer to an individual tree as
treesb[[k]]
, k = 1, ..., length(plarge)*length(psmall)- acc1st
accuracies of best trees on full sample
- acc3en
accuracies of ensemble of 3 best trees on full sample
- simp_m
mean of permutation losses for the predictors
Examples
datastrat <- PrInDT::data_zero
data <- na.omit(datastrat) # cleaned full data: no NAs
# interpretation restrictions (split exclusions)
ctestv <- rbind('ETH == {C2a, C1a}', 'MLU == {1, 3}')
N <- 51 # no. of repetitions
conf.level <- 0.99 # 1 - significance level (mincriterion) in ctree
psmall <- c(0.95,1) # percentages of the small class
plarge <- c(0.09,0.1) # percentages of the large class
outRe <- RePrInDT(data,"real",ctestv,N,plarge,psmall,conf.level) # might take 5 minutes
outRe
plot(outRe)