R: Multidimensional scaling of probability densities

fmdsd {dad}

R Documentation

Multidimensional scaling of probability densities

Description

Applies the multidimensional scaling (MDS) method to probability densities in order to describe a data folder, consisting of T groups of individuals on which are observed p variables. It returns an object of class fmdsd. It applies cmdscale to the distance matrix between the T densities.

Usage

fmdsd(xf, group.name = "group", gaussiand = TRUE, distance = c("jeffreys", "hellinger",
    "wasserstein", "l2", "l2norm"), windowh=NULL, data.centered = FALSE,
    data.scaled = FALSE, common.variance = FALSE, add = TRUE, nb.factors = 3,
    nb.values = 10, sub.title = "", plot.eigen = TRUE, plot.score = FALSE, nscore = 1:3,
    filename = NULL)

Arguments

`xf`	object of class `"folder"` or data.frame. If it is an object of class `"folder"`, its elements are data frames with `p` numeric columns. If there are non numeric columns, there is an error. The `t^{th}` element (`t = 1, \ldots, T`) matches with the `t^{th}` group. If it is a data frame, the column with name given by the `group.name` argument is a factor giving the groups. The other columns are all numeric; otherwise, there is an error.
`group.name`	string. If `xf` is an object of class `"folder"`, it is the name of the grouping variable in the returned results. The default is `groupname = "group"`. If `xf` is a data frame, it is the name of the column of `xf` containing the groups.
`gaussiand`	logical. If `TRUE` (default), the probability densities are supposed Gaussian. If `FALSE`, densities are estimated using the Gaussian kernel method.
`distance`	The distance or divergence used to compute the distance matrix between the densities. If `gaussiand = TRUE`, the densities are parametrically estimated and the distance can be: `"jeffreys"` (default) Jeffreys measure (symmetrised Kullback-Leibler divergence), `"hellinger"` the Hellinger (Matusita) distance, `"wasserstein"` the Wasserstein distance, `"l2"` the `L^2` distance, `"l2norm"` the densities are normed and the `L^2` distance between these normed densities is used; If `gaussiand = FALSE`, the densities are estimated by the Gaussian kernel method and the distance can be `"l2"` (default) or `"l2norm"`.
`windowh`	either a list of `T` bandwidths (one per density associated to a group), or a strictly positive number. If `windowh = NULL` (default), the bandwidths are automatically computed. See Details. Omitted when `distance` is `"hellinger"`, `"jeffreys"` or `"wasserstein"` (see Details).
`data.centered`	logical. If `TRUE` (default is `FALSE`), the data of each group are centered.
`data.scaled`	logical. If `TRUE` (default is `FALSE`), the data of each group are centered (even if `data.centered = FALSE`) and scaled.
`common.variance`	logical. If `TRUE` (default is `FALSE`), a common covariance matrix (or correlation matrix if `data.scaled = TRUE`), computed on the whole data, is used. If `FALSE` (default), a covariance (or correlation) matrix per group is used.
`add`	logical indicating if an additive constant should be computed and added to the non diagonal dissimilarities such that the modified dissimilarities are Euclidean (default `TRUE`; see `add` argument of `cmdscale`).
`nb.factors`	numeric. Number of returned principal coordinates (default `nb.factors = 3`). Warning: The `plot.fmdsd` and `interpret.fmdsd` functions cannot take into account more than `nb.factors` principal factors.
`nb.values`	numeric. Number of returned eigenvalues (default `nb.values = 10`).
`sub.title`	string. Subtitle for the graphs (default `NULL`).
`plot.eigen`	logical. If `TRUE` (default), the barplot of the eigenvalues is plotted.
`plot.score`	logical. If `TRUE`, the graphs of new coordinates are plotted. A new graphic device is opened for each pair of coordinates defined by `nscore` argument.
`nscore`	numeric vector. If `plot.score = TRUE`, the numbers of the principal coordinates which are plotted. By default it is equal to `nscore = 1:3`. Its components cannot be greater than `nb.factors`.
`filename`	string. Name of the file in which the results are saved. By default (`filename = NULL`) they are not saved.

Details

In order to compute the distances/dissimilarities between the groups, the T probability densities f_t corresponding to the T groups of individuals are either parametrically estimated (gaussiand = TRUE) or estimated using the Gaussian kernel method (gaussiand = FALSE). In the latter case, the windowh argument provides the list of the bandwidths to be used. Notice that in the multivariate case (p>1), the bandwidths are positive-definite matrices.

If windowh is a numerical value, the matrix bandwidth is of the form h S, where S is either the square root of the covariance matrix (p>1) or the standard deviation of the estimated density.

If windowh = NULL (default), h in the above formula is computed using the bandwidth.parameter function.

The distance or dissimilarity between the estimated densities is either the L^2 distance, the Hellinger distance, Jeffreys measure (symmetrised Kullback-Leibler divergence) or the Wasserstein distance.

If it is the L^2 distance (distance="l2" or distance="l2norm"), the densities can be either parametrically estimated or estimated using the Gaussian kernel.
If it is the Hellinger distance (distance="hellinger"), Jeffreys measure (distance="jeffreys") or the Wasserstein distance (distance="wasserstein"), the densities are considered Gaussian and necessarily parametrically estimated.

Value

Returns an object of class fmdsd, i.e. a list including:

`inertia`	data frame of the eigenvalues and percentages of inertia.
`scores`	data frame of the `nb.factors` first principal coordinates.
`means`	list of the means.
`variances`	list of the covariance matrices.
`correlations`	list of the correlation matrices.
`skewness`	list of the skewness coefficients.
`kurtosis`	list of the kurtosis coefficients.

Author(s)

Rachid Boumaza, Pierre Santagostini, Smail Yousfi, Gilles Hunault, Sabine Demotes-Mainard

References

Boumaza, R., Yousfi, S., Demotes-Mainard, S. (2015). Interpreting the principal component analysis of multivariate density functions. Communications in Statistics - Theory and Methods, 44 (16), 3321-3339.

Delicado, P. (2011). Dimensionality reduction when data are density functions. Computational Statistics & Data Analysis, 55, 401-420.

Yousfi, S., Boumaza, R., Aissani, D., Adjabi, S. (2014). Optimal bandwith matrices in functional principal component analysis of density function. Journal of Statistical Computation and Simulation, 85 (11), 2315-2330.

Cox, T.F., Cox, M.A.A. (2001). Multimensional Scaling, second ed. Chapman & Hall/CRC.

Examples

data(roses)
rosesf <- as.folder(roses[,c("Sha","Den","Sym","rose")])

# MDS on Gaussian densities (on sensory data)

# using jeffreys measure (default):
resultjeff <- fmdsd(rosesf, distance = "jeffreys")
print(resultjeff)
plot(resultjeff)

## Not run: 
# Applied to a data frame:
resultjeffdf <- fmdsd(roses[,c("Sha","Den","Sym","rose")],
                      distance = "jeffreys", group.name = "rose")
print(resultjeffdf)
plot(resultjeffdf)

## End(Not run)

# using the Hellinger distance:
resulthellin <- fmdsd(rosesf, distance = "hellinger")
print(resulthellin)
plot(resulthellin)

# using the Wasserstein distance:
resultwass <- fmdsd(rosesf, distance = "wasserstein")
print(resultwass)
plot(resultwass)

# Gaussian case, using the L2-distance:
resultl2 <- fmdsd(rosesf, distance = "l2")
print(resultl2)
plot(resultl2)

# Gaussian case, using the L2-distance between normed densities:
resultl2norm <- fmdsd(rosesf, distance = "l2norm")
print(resultl2norm)
plot(resultl2norm)

## Not run: 
# Non Gaussian case, using the L2-distance,
# the densities are estimated using the Gaussian kernel method:
result <- fmdsd(rosesf, distance = "l2", gaussiand = FALSE, group.name = "rose")
print(result)       
plot(result)

## End(Not run)

[Package dad version 4.1.2 Index]