R: Misclassification ratio in functional discriminant analysis...

discdd.misclass {dad}

R Documentation

Misclassification ratio in functional discriminant analysis of discrete probability distributions.

Description

Computes the one-leave-out misclassification ratio of the rule assigning T groups of individuals, one group after another, to the class of groups (among K classes of groups) which achieves the minimum of the distances or divergences between the probability distribution associated to the group to assign and the K probability distributions associated to the K classes.

Usage

discdd.misclass(xf, class.var, distance =  c("l1", "l2", "chisqsym", "hellinger",
           "jeffreys", "jensen", "lp"), crit = 1, p)

Arguments

`xf`	object of class `folderh` with two data frames or list of arrays (or tables). If it is a `folderh`: The first data.frame has at least two columns. One column contains the names of the `T` groups (all the names must be different). An other column is a factor with `K` levels partitionning the T groups into K classes. The second one has `(q+1)` columns. The first `q` columns are factors (otherwise, they are coerced into factors). The last column is a factor with `T` levels defining `T` groups. Each group, say `t`, consists of `n_t` individuals. If it is a list of arrays or tables, the `t^{th}` element (`t = 1, \ldots, T`) is the table of the joint distribution (absolute or relative frequencies) of the `t^{th}` group. These arrays have the same shape: Each array (or table) `xf[[i]]` has: the same dimension(s). If `q = 1` (univariate), `dim(xf[[i]])` is an integer. If `q > 1` (multivariate), `dim(xf[[i]])` is an integer vector of length `q`. the same dimension names `dimnames(xf[[i]])` (is non `NULL`). These dimnames are the names of the variables.
`class.var`	string (if `xf` is an object of class `"folderh"`) or data.frame with two columns (if `xf` is a list of arrays). If `xf` is of class `"folder"`, `class.var` is the name of the class variable. If `xf` is a list of arrays or a list of tables, `class.var` is a data.frame with at least two columns named `"group"` and `"class"`. The `"group"` column contains the names of the `T` groups (all the names must be different). The `"class"` column is a factor with `K` levels partitioning the `T` groups into `K` classes.
`distance`	The distance or dissimilarity used to compute the distance matrix between the densities. It can be: `"l1"` (default) the `L^p` distance with `p = 1` `"l2"` the `L^p` distance with `p = 2` `"chisqsym"` the symmetric Chi-squared distance `"hellinger"` the Hellinger metric (Matusita distance) `"jeffreys"` Jeffreys distance (symmetrised Kullback-Leibler divergence) `"jensen"` the Jensen-Shannon distance `"lp"` the `L^p` distance with `p` given by the argument `p` of the function.
`crit`	1 or 2. In order to select the densities associated to the classes. See Details.
`p`	integer. Optional. When `distance = "lp"` (`L^p` distance with `p>2`), `p` is the parameter of the distance.

Details

If xf is an object of class "folderh" containing the data:

The T probability distributions f_t corresponding to the T groups of individuals are estimated by frequency distributions within each group.

To the class k consisting of T_k groups is associated the probability distribution g_k, knowing that when using the one-leave-out method, we do not include the group to assign in its class k. The crit argument selects the estimation method of the g_k's.
- crit=1 The probability distribution g_k is estimated using the whole data of this class, that is the rows of x corresponding to the T_k groups of the class k.
  
  The estimation of the g_k's uses the same method as the estimation of the f_t's.
- crit=2 The T_k probability distributions f_t are estimated using the corresponding data from xf. Then they are averaged to obtain an estimation of the density g_k, that is g_k = \frac{1}{T_k} \, \sum{f_t}.
If xf is a list of arrays (or list of tables):

The t^{th} array is the joint frequency distribution of the t^{th} group. The frequencies can be absolute or relative.

To the class k consisting of T_k groups is associated the probability distribution g_k, knowing that when using the one-leave-out method, we do not include the group to assign in its class k. The crit argument selects the estimation method of the g_k's.
- crit=1 g_k = \frac{1}{\sum n_t} \sum n_t f_t, where n_t is the total of xf[[t]].
  
  Notice that when xf[[t]] contains relative frequencies, its total is 1. That is equivalent to crit=2.
- crit=2 g_k = \frac{1}{T_k} \, \sum f_t.

Value

Returns an object of class discdd.misclass, that is a list including:

`classification`	data frame with 4 columns: factor giving the group name. The column name is the same as that of the column (`q+1`) of `x`, the prior class of the group if it is available, or NA if not, `alloc`: the class allocation computed by the discriminant analysis method, `misclassed`: boolean. `TRUE` if the group is misclassed, `FALSE` if it is well-classed, `NA` if the prior class of the group is unknown.
`confusion.mat`	confusion matrix,
`misalloc.per.class`	the misclassification ratio per class,
`misclassed`	the misclassification ratio,
`distances`	matrix with `T` rows and `K` columns, of the distances (`d_{tk}`): `d_{tk}` is the distance between the group `t` and the class `k`,
`proximities`	matrix of the proximity indices (in percents) between the groups and the classes. The proximity between the group `t` and the class `k` is: `(1/d_{tk})/\sum_{l=1}^{l=K}(1/d_{tl})`.

Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

References

Rudrauf, J.M., Boumaza, R. (2001). Contribution à l'étude de l'architecture médiévale: les caractéristiques des pierres à bossage des châteaux forts alsaciens, Centre de Recherches Archéologiques médiévales de Saverne, 5, 5-38.

Examples

# Example 1 with a folderh obtained by converting numeric variables
data("castles.dated")
stones <- castles.dated$stones
periods <- castles.dated$periods
stones$height <- cut(stones$height, breaks = c(19, 27, 40, 71), include.lowest = TRUE)
stones$width <- cut(stones$width, breaks = c(24, 45, 62, 144), include.lowest = TRUE)
stones$edging <- cut(stones$edging, breaks = c(0, 3, 4, 8), include.lowest = TRUE)
stones$boss <- cut(stones$boss, breaks = c(0, 6, 9, 20), include.lowest = TRUE )

castlefh <- folderh(periods, "castle", stones)

# Default: dist="l1", crit=1
discdd.misclass(castlefh, "period")

# Hellinger distance, crit=2
discdd.misclass(castlefh, "period", distance = "hellinger", crit = 2)


# Example 2 with a list of 96 arrays
data("dspgd2015")
data("departments")
classes <- departments[, c("coded", "namer")]
names(classes) <- c("group", "class")

# Default: dist="l1", crit=1
discdd.misclass(dspgd2015, classes)

# Hellinger distance, crit=2
discdd.misclass(dspgd2015, classes, distance = "hellinger", crit = 2)

[Package dad version 4.1.2 Index]