discdd.predict {dad}R Documentation

Predicting the class of a group of individuals with discriminant analysis of probability distributions.

Description

Assigns several groups of individuals, one group after another, to the class of groups (among K classes of groups) which achieves the minimum of the distances or divergences between the probability distribution associated to the group to assign and the K probability distributions associated to the K classes.

Usage

discdd.predict(xf, class.var, distance =  c("l1", "l2", "chisqsym", "hellinger",
           "jeffreys", "jensen", "lp"), crit = 1, misclass.ratio = FALSE, p)

Arguments

xf

object of class folderh with two data frames or list of arrays (or tables).

  • If it is a folderh:

    • The first data.frame has at least two columns. One column contains the names of the T groups (all the names must be different). An other column is a factor with K levels partitionning the T groups into K classes.

    • The second one has (q+1) columns. The first q columns are factors (otherwise, they are coerced into factors). The last column is a factor with T levels defining T groups. Each group, say t, consists of n_t individuals.

  • If it is a list of arrays or tables, the t^{th} element (t = 1, \ldots, T) is the table of the joint distribution (absolute or relative frequencies) of the t^{th} group. These arrays have the same shape:

    Each array (or table) xf[[i]] has:

    • the same dimension(s). If q = 1 (univariate), dim(xf[[i]]) is an integer. If q > 1 (multivariate), dim(xf[[i]]) is an integer vector of length q.

    • the same dimension names dimnames(xf[[i]]) (is non NULL). These dimnames are the names of the variables.

class.var

string (if xf is an object of class "folderh") or data.frame with two columns (if xf is a list of arrays).

  • If xf is of class "folder", class.var is the name of the class variable.

  • If xf is a list of arrays or a list of tables, class.var is a data.frame with at least two columns named "group" and "class". The "group" column contains the names of the T groups (all the names must be different). The "class" column is a factor with K levels partitioning the T groups into K classes.

distance

The distance or dissimilarity used to compute the distance matrix between the densities. It can be:

  • "l1" (default) the L^p distance with p = 1

  • "l2" the L^p distance with p = 2

  • "chisqsym" the symmetric Chi-squared distance

  • "hellinger" the Hellinger metric (Matusita distance)

  • "jeffreys" Jeffreys distance (symmetrised Kullback-Leibler divergence)

  • "jensen" the Jensen-Shannon distance

  • "lp" the L^p distance with p given by the argument p of the function.

crit

1 or 2. In order to select the densities associated to the classes. See Details.

misclass.ratio

logical (default FALSE). If TRUE, the confusion matrix and misclassification ratio are computed on the groups whose prior class is known. In order to compute the misclassification ratio by the one-leave-out method, use the discdd.misclass function.

p

integer. Optional. When distance = "lp" (L^p distance with p>2), p is the parameter of the distance.

Details

Value

Returns an object of class discdd.predict, that is a list including:

prediction

data frame with 3 columns:

  • factor giving the group name. The column name is the same as that of the column (q+1) of x,

  • class.known: the prior class of the group if it is available, or NA if not,

  • class.predict: the class allocation predicted by the discriminant analysis method. If misclass.ratio = TRUE, the class allocations are computed for all groups. Otherwise (default), they are computed only for the groups whose class is unknown.

distances

matrix with T rows and K columns, of the distances (d_{tk}): d_{tk} is the distance between the group t and the class k, computed with the measure given by argument,

proximities

matrix of the proximities (in percents). The proximity of a group t to the class k is computed as so: (1/d_{tk})/\sum_{l=1}^{l=K}(1/d_{tl}).

confusion.mat

the confusion matrix (if misclass.ratio = TRUE)

misclassed

the misclassification ratio (if misclass.ratio = TRUE)

Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

References

Rudrauf, J.M., Boumaza, R. (2001). Contribution à l'étude de l'architecture médiévale: les caractéristiques des pierres à bossage des châteaux forts alsaciens, Centre de Recherches Archéologiques médiévales de Saverne, 5, 5-38.

Examples

data(castles.dated)
data(castles.nondated)
stones <- rbind(castles.dated$stones, castles.nondated$stones)
periods <- rbind(castles.dated$periods, castles.nondated$periods)
stones$height <- cut(stones$height, breaks = c(19, 27, 40, 71), include.lowest = TRUE)
stones$width <- cut(stones$width, breaks = c(24, 45, 62, 144), include.lowest = TRUE)
stones$edging <- cut(stones$edging, breaks = c(0, 3, 4, 8), include.lowest = TRUE)
stones$boss <- cut(stones$boss, breaks = c(0, 6, 9, 20), include.lowest = TRUE )

castlesfh <- folderh(periods, "castle", stones)

# Default: dist="l1", crit=1
discdd.predict(castlesfh, "period")

# With the calculation of the confusion matrix and misclassification ratio
discdd.predict(castlesfh, "period", misclass.ratio = TRUE)

# Hellinger distance
discdd.predict(castlesfh, "period", distance = "hellinger")

# crit=2
discdd.predict(castlesfh, "period", crit = 2)

[Package dad version 4.1.2 Index]