PrInDTMulab {PrInDT}R Documentation

Multiple label classification based on resampling by PrInDT

Description

Multiple label classification based on resampling by PrInDT. We consider two ways of modeling (Binary relevance modeling, dependent binary modeling) and three ways of model evaluation: single assessment, joint assessment, and true prediction (see the Value section for more information).
Variables should be arranged in 'datain' according to indices specified in 'indind', 'indaddind', and 'inddep'.
Undersampling is repeated 'N' times.
Undersampling percentages 'percl' for the larger class and 'percs' for the smaller class can be specified, one each per dependent class variable.

Reference
Probst, P., Au, Q., Casalicchio, G., Stachl, C., and Bischl, B. 2017. Multilabel Classification with R Package mlr. arXiv:1703.08991v2

Usage

PrInDTMulab(datain, classnames, ctestv, conf.level=0.95, percl, percs=1,
       N, indind, indaddind, inddep)

Arguments

datain

Input data frame with class factor variable 'classname' and the
influential variables, which need to be factors or numericals (transform logicals and character variables to factors)

classnames

names of class variables (character vector)

ctestv

Vector of character strings of forbidden split results;
see function PrInDT for details.
If no restrictions exist, the default = NA is used.

conf.level

(1 - significance level) in function ctree (numerical, > 0 and <= 1);
default = 0.95

percl

list of undersampling percentages of larger class (numerical, > 0 and <= 1): one per dependent class variable

percs

list of undersampling percentage of smaller class (numerical, > 0 and <= 1); one per dependent class variable

N

no. of repetitions (integer > 0)

indind

indices of independent variables

indaddind

indices of additional independent variables used in the case of dependent binary relevance modeling

inddep

indices of dependent variables

Details

Standard output can be produced by means of print(name) or just name as well as plot(name) where 'name' is the output data frame of the function.
The plot function will produce a series of more than one plot. If you use R, you might want to specify windows(record=TRUE) before plot(name) to save the whole series of plots. In R-Studio this functionality is provided automatically.

Value

accbr

model errors for Binary Relevance (single assessment) - only independent predictors are used for modeling one label at a time, the other labels are not used as predictors. As the performance measure for the resulting classification rules, the balanced accuracy of the best model from PrInDT is employed for each individual label.

errbin

combined error for Binary Relevance (joint assessment) - the best prediction models for the different labels are combined to assess the combined prediction. The 01-accuracy counts a label combination as correct only if all labels are correctly predicted. The hamming accuracy corresponds to the proportion of labels whose value is correctly predicted.

accdbr

model errors for Dependent Binary Relevance (Extended Model) (single assessment) - each label is trained by means of an extended model which not only includes the independent predictors but also the other labels. For these labels, the truly observed values are used for estimation and prediction. In the extended model, other labels, which are not treated as dependent variables, can also be used as additional predictors.

errext

combined errors for Dependent Binary Relevance (Extended Model) (joint assessment)

errtrue

combined errors for Dependent Binary Relevance (True Prediction) - in the prediction phase, the values of all modeled labels are first predicted by the independent predictors only and then the predicted labels are used in the estimated extended model in a 2nd step to ultimately predict the labels.

coldata

column names of input data

inddep

indices of dependent variables (labels to be modeled)

treebr

list of trees from Binary Relevance modeling, one tree for each label; refer to an individual tree as treebr[[i]], i = 1, ..., no. of labels

treedbr

list of trees from Dependent Binary Relevance modeling, one for each label; refer to an individual tree as treedbr[[i]], i = 1, ..., no. of labels

Examples

data <- PrInDT::data_land # load data
dataclean <- data[,c(1:7,23:24,11:13,22,8:10)]  # only relevant features
indind <- c(1:9) # original predictors
indaddind <- c(10:13) # additional predictors
inddep <- c(14:16) # dependent variables
dataclean <- na.omit(dataclean)
ctestv <- NA
N <- 21  # no. of repetitions
perc <- c(0.45,0.05,0.25)   # percentages of observations of larger class, 
#                             1 per dependent class variable
perc2 <- c(0.75,0.95,0.75)  # percentages of observations of smaller class, 
#                             1 per dependent class variable
##
# Call PrInDT: language by language
##
outmult <- PrInDTMulab(dataclean,colnames(dataclean)[inddep],ctestv=NA,conf.level=0.95,
                  percl=perc,percs=perc2,N,indind,indaddind,inddep)
print(outmult)
plot(outmult)


[Package PrInDT version 1.0.1 Index]