R: Matthews Correlation Coefficient (MCC)

ClusterMCC {FCPS}

R Documentation

Matthews Correlation Coefficient (MCC)

Description

Matthews correlation coefficient eneralized to the multiclass case (a.k.a. R_K statistic).

Usage

ClusterMCC(PriorCls, CurrentCls,Force=TRUE)

Arguments

`PriorCls`	Ground truth,[1:n] numerical vector with n numbers defining the classification. It has k unique numbers representing the labels of the clustering.
`CurrentCls`	Main output of the clustering, [1:n] numerical vector with n numbers defining the classification. It has k unique numbers representing the labels of the clustering.
`Force`	Boolean, if is TRUE: forces code even if one or more than one of the k numbers given in `PriorCls` is missing in `CurrentCls` or vice versa. In this case, one label per missing number is added ad the end of the vectors.

Details

Contrary to accuracy, the MCC is balanced measure which can be used even if the classes are of very different sizes. When there are more than two labels the MCC will no longer range between -1 and +1. Instead the minimum value will be between -1 and 0 depending on the true distribution. The maximum value is always +1. Beware that in contrast to ClusterAccuracy, the labels cannot be arbitrary. Instead each label of PriorCls and CurrentCls has to be mapped to the same cluster of data points. Typically this has to be ensured manually.

Value

Single scalar of MCC in a range described in details.

Note

If No. of Clusters is not equivalent, internally the number is allgined with zero datapoints belonging to the missing clusters.

Author(s)

Michael Thrun

References

Matthews, B. W.: Comparison of the predicted and observed secondary structure of T4 phage lysozyme, Biochimica et Biophysica Acta (BBA), Protein Structure, Vol. 405(2), pp. 442-451, 1975.

Boughorbel, S.B: Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric, PLOS ONE, Vol. 12(6), pp. e0177678, 2017.

Chicco, D.; Toetsch, N. and Jurman, G.: The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two_class confusion matrix evaluation. BioData Mining. Vol. 14., 2021.

Examples

#Beware that algorithm arbitrary defines the labels
data(Hepta)
V=kmeansClustering(Hepta$Data,Type = "Hartigan",7)
table(V$Cls,Hepta$Cls)
#result is only valid if the above issue is resolved manually
ClusterMCC(Hepta$Cls,V$Cls)

[Package FCPS version 1.3.4 Index]