error_group {OTrecod}R Documentation

error_group()

Description

This function studies the association between two categorical distributions with different numbers of modalities.

Usage

error_group(REF, Z, ord = TRUE)

Arguments

REF

a factor with a reference number of levels.

Z

a factor with a number of levels greater than the number of levels of the reference.

ord

a boolean. If TRUE, only neighboring levels of ZZ will be grouped and tested together.

Details

Assuming that YY and ZZ are categorical variables summarizing a same information, and that one of the two related encodings is unknown by user because this latter is, for example, the result of predictions provided by a given model or algorithm, the function error_group searches for potential links between the modalities of YY to approach at best the distribution of ZZ.

Assuming that YY and ZZ have nYn_Y and nZn_Z modalities respectively so that nY>nZn_Y > n_Z, in a first step, the function error_group combines modalities of YY to build all possible variables YY' verifying nY=nZn_{Y'} = n_Z. In a second step, the association between ZZ and each new variable YY' generated is measured by studying the ratio of concordant pairs related to the confusion matrix but also using standard criterions: the Cramer's V (1), the Cohen's kappa coefficient (2) and the Spearman's rank correlation coefficient.

According to the type of YY, different combinations of modalities are tested:

All the associations tested are listed in output as a data.frame object. The function error_group is directly integrated in the function verif_OT to evaluate the proximity of two multinomial distributions, when one of them is estimated from the predictions of an OT algorithm.

Example: Assuming that Y=(1,1,2,2,3,3,4,4)Y = (1,1,2,2,3,3,4,4) and Z=(1,1,1,1,2,2,2,2)Z = (1,1,1,1,2,2,2,2), so nY=4n_Y = 4 and nZ=2n_Z = 2 and the related coefficient of correlation cor(Y,Z)cor(Y,Z) is 0.89. Are there groupings of modalities of YY which contribute to improving the proximity between YY and ZZ ? From YY, the function error_group gives an answer to this question by successively constructing the variables: Y1=(1,1,1,1,2,2,2,2)Y_1 = (1,1,1,1,2,2,2,2), Y2=(1,1,2,2,1,1,2,2)Y_2 = (1,1,2,2,1,1,2,2), Y3=(1,1,2,2,2,2,1,1)Y_3 = (1,1,2,2,2,2,1,1) and tests \mboxcor(Z,Y1)=1\mbox{cor}(Z,Y_1) = 1, \mboxcor(Z,Y2)=0\mbox{cor}(Z,Y_2) = 0, \mboxcor(Z,Y3)=0\mbox{cor}(Z,Y_3) = 0. Here, the tests permit to conclude that the difference of encodings between YY and ZZ resulted in fact in a simple grouping of modalities.

Value

A data.frame with five columns:

combi

the first column enumerates all possible groups of modalities of YY to obtain the same number of levels as the reference.

error_rate

the second column gives the corresponding rate error from the confusion matrix (ratio of non-diagonal elements)

Kappa

this column indicates the result of the Cohen's kappa coefficient related to each combination of YY

Vcramer

this column indicates the result of the Cramer's V criterion related to each combination of YY

RankCor

this column indicates the result of the Spearman's coefficient of correlation related to each combination of YY

Author(s)

Gregory Guernec

otrecod.pkg@gmail.com

References

  1. Cramér, Harald. (1946). Mathematical Methods of Statistics. Princeton: Princeton University Press.

  2. McHugh, Mary L. (2012). Interrater reliability: The kappa statistic. Biochemia Medica. 22 (3): 276–282

Examples


# Basic examples:
sample1 <- as.factor(sample(1:3, 50, replace = TRUE))
length(sample1)
sample2 <- as.factor(sample(1:2, 50, replace = TRUE))
length(sample2)
sample3 <- as.factor(sample(c("A", "B", "C", "D"), 50, replace = TRUE))
length(sample3)
sample4 <- as.factor(sample(c("A", "B", "C", "D", "E"), 50, replace = TRUE))
length(sample4)

# By only grouping consecutive levels of sample1:
error_group(sample1, sample4)
# By only all possible levels of sample1, consecutive or not:
error_group(sample2, sample1, ord = FALSE)



### using a sample of the tab_test object (3 complete covariates)
### Y1 and Y2 are a same variable encoded in 2 different forms in DB 1 and 2:
### (4 levels for Y1 and 3 levels for Y2)

data(tab_test)
# Example with n1 = n2 = 70 and only X1 and X2 as covariates
tab_test2 <- tab_test[c(1:70, 5001:5070), 1:5]

### An example of JOINT model (Manhattan distance)
# Suppose we want to impute the missing parts of Y1 in DB2 only ...
try1J <- OT_joint(tab_test2,
  nominal = c(1, 4:5), ordinal = c(2, 3),
  dist.choice = "M", which.DB = "B"
)

# Error rates between Y2 and the predictions of Y1 in the DB 2
# by grouping the levels of Y1:
error_group(try1J$DATA2_OT$Z, try1J$DATA2_OT$OTpred)
table(try1J$DATA2_OT$Z, try1J$DATA2_OT$OTpred)



[Package OTrecod version 0.1.2 Index]