R: Predicting the class of a group of individuals with...

discdd.predict {dad}

R Documentation

Predicting the class of a group of individuals with discriminant analysis of probability distributions.

Description

Assigns several groups of individuals, one group after another, to the class of groups (among K classes of groups) which achieves the minimum of the distances or divergences between the probability distribution associated to the group to assign and the K probability distributions associated to the K classes.

Usage

discdd.predict(xf, class.var, distance =  c("l1", "l2", "chisqsym", "hellinger",
           "jeffreys", "jensen", "lp"), crit = 1, misclass.ratio = FALSE, p)

Arguments

`xf`	object of class `folderh` with two data frames or list of arrays (or tables). If it is a `folderh`: The first data.frame has at least two columns. One column contains the names of the `T` groups (all the names must be different). An other column is a factor with `K` levels partitionning the T groups into K classes. The second one has `(q+1)` columns. The first `q` columns are factors (otherwise, they are coerced into factors). The last column is a factor with `T` levels defining `T` groups. Each group, say `t`, consists of `n_t` individuals. If it is a list of arrays or tables, the `t^{th}` element (`t = 1, \ldots, T`) is the table of the joint distribution (absolute or relative frequencies) of the `t^{th}` group. These arrays have the same shape: Each array (or table) `xf[[i]]` has: the same dimension(s). If `q = 1` (univariate), `dim(xf[[i]])` is an integer. If `q > 1` (multivariate), `dim(xf[[i]])` is an integer vector of length `q`. the same dimension names `dimnames(xf[[i]])` (is non `NULL`). These dimnames are the names of the variables.
`class.var`	string (if `xf` is an object of class `"folderh"`) or data.frame with two columns (if `xf` is a list of arrays). If `xf` is of class `"folder"`, `class.var` is the name of the class variable. If `xf` is a list of arrays or a list of tables, `class.var` is a data.frame with at least two columns named `"group"` and `"class"`. The `"group"` column contains the names of the `T` groups (all the names must be different). The `"class"` column is a factor with `K` levels partitioning the `T` groups into `K` classes.
`distance`	The distance or dissimilarity used to compute the distance matrix between the densities. It can be: `"l1"` (default) the `L^p` distance with `p = 1` `"l2"` the `L^p` distance with `p = 2` `"chisqsym"` the symmetric Chi-squared distance `"hellinger"` the Hellinger metric (Matusita distance) `"jeffreys"` Jeffreys distance (symmetrised Kullback-Leibler divergence) `"jensen"` the Jensen-Shannon distance `"lp"` the `L^p` distance with `p` given by the argument `p` of the function.
`crit`	1 or 2. In order to select the densities associated to the classes. See Details.
`misclass.ratio`	logical (default `FALSE`). If `TRUE`, the confusion matrix and misclassification ratio are computed on the groups whose prior class is known. In order to compute the misclassification ratio by the one-leave-out method, use the `discdd.misclass` function.
`p`	integer. Optional. When `distance = "lp"` (`L^p` distance with `p>2`), `p` is the parameter of the distance.

Details

If xf is an object of class "folderh" containing the data:

The T probability distributions f_t corresponding to the T groups of individuals are estimated by frequency distributions within each group.

To the class k consisting of T_k groups is associated the probability distribution g_k. The crit argument selects the estimation method of the g_k's.
- crit=1 The probability distribution g_k is estimated using the whole data of this class, that is the rows of x corresponding to the T_k groups of the class k.
  
  The estimation of the g_k's uses the same method as the estimation of the f_t's.
- crit=2 The T_k probability distributions f_t are estimated using the corresponding data from xf. Then they are averaged to obtain an estimation of the density g_k, that is g_k = \frac{1}{T_k} \, \sum{f_t}.
If xf is a list of arrays (or list of tables):

The t^{th} array is the joint frequency distribution of the t^{th} group. The frequencies can be absolute or relative.

To the class k consisting of T_k groups is associated the probability distribution g_k. The crit argument selects the estimation method of the g_k's.
- crit=1 g_k = \frac{1}{\sum n_t} \sum n_t f_t, where n_t is the total of xf[[t]].
  
  Notice that when xf[[t]] contains relative frequencies, its total is 1. That is equivalent to crit=2.
- crit=2 g_k = \frac{1}{T_k} \, \sum f_t.

Value

Returns an object of class discdd.predict, that is a list including:

`prediction`	data frame with 3 columns: factor giving the group name. The column name is the same as that of the column (`q+1`) of `x`, `class.known`: the prior class of the group if it is available, or NA if not, `class.predict`: the class allocation predicted by the discriminant analysis method. If `misclass.ratio = TRUE`, the class allocations are computed for all groups. Otherwise (default), they are computed only for the groups whose class is unknown.
`distances`	matrix with `T` rows and `K` columns, of the distances (`d_{tk}`): `d_{tk}` is the distance between the group `t` and the class `k`, computed with the measure given by argument,
`proximities`	matrix of the proximities (in percents). The proximity of a group `t` to the class `k` is computed as so: `(1/d_{tk})/\sum_{l=1}^{l=K}(1/d_{tl})`.
`confusion.mat`	the confusion matrix (if `misclass.ratio = TRUE`)
`misclassed`	the misclassification ratio (if `misclass.ratio = TRUE`)

Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

References

Rudrauf, J.M., Boumaza, R. (2001). Contribution à l'étude de l'architecture médiévale: les caractéristiques des pierres à bossage des châteaux forts alsaciens, Centre de Recherches Archéologiques médiévales de Saverne, 5, 5-38.

Examples

data(castles.dated)
data(castles.nondated)
stones <- rbind(castles.dated$stones, castles.nondated$stones)
periods <- rbind(castles.dated$periods, castles.nondated$periods)
stones$height <- cut(stones$height, breaks = c(19, 27, 40, 71), include.lowest = TRUE)
stones$width <- cut(stones$width, breaks = c(24, 45, 62, 144), include.lowest = TRUE)
stones$edging <- cut(stones$edging, breaks = c(0, 3, 4, 8), include.lowest = TRUE)
stones$boss <- cut(stones$boss, breaks = c(0, 6, 9, 20), include.lowest = TRUE )

castlesfh <- folderh(periods, "castle", stones)

# Default: dist="l1", crit=1
discdd.predict(castlesfh, "period")

# With the calculation of the confusion matrix and misclassification ratio
discdd.predict(castlesfh, "period", misclass.ratio = TRUE)

# Hellinger distance
discdd.predict(castlesfh, "period", distance = "hellinger")

# crit=2
discdd.predict(castlesfh, "period", crit = 2)

[Package dad version 4.1.2 Index]