sim_eval {resemble} | R Documentation |
A function for evaluating dissimilarity matrices (sim_eval)
Description
This function searches for the most similar observation (closest neighbor) of each observation in a given dataset based on a dissimilarity (e.g. distance matrix). The observations are compared against their corresponding closest observations in terms of their side information provided. The root mean square of differences and the correlation coefficient are used for continuous variables and for discrete variables the kappa index is used.
Usage
sim_eval(d, side_info)
Arguments
d |
a symmetric matrix of dissimilarity scores between observations of a given dataset. Alternatively, a vector of with the dissimilarity scores of the lower triangle (without the diagonal values) can be used (see details). |
side_info |
a matrix containing the side information corresponding to the observations in the dataset from which the dissimilarity matrix was computed. It can be either a numeric matrix with one or multiple columns/variables or a matrix with one character variable (discrete variable). If it is numeric, the root mean square of differences is used for assessing the similarity between the observations and their corresponding most similar observations in terms of the side information provided. If it is a character variable, then the kappa index is used. See details. |
Details
For the evaluation of dissimilarity matrices this function uses side
information (information about one variable which is available for a
group of observations, Ramirez-Lopez et al., 2013). It is assumed that there
is a (direct or indirect) correlation between this side informative variable
and the variables from which the dissimilarity was computed.
If side_info
is numeric, the root mean square of differences (RMSD)
is used for assessing the similarity between the observations and their
corresponding most similar observations in terms of the side information
provided. It is computed as follows:
where \(NN(xr_i, Xr^{-i})\) represents a function to obtain the index of the nearest neighbor observation found in \(Xr\) (excluding the \(i\)th observation) for \(xr_i\), \(y_{i}\) is the value of the side variable of the \(i\)th observation, \(y_{j(i)}\) is the value of the side variable of the nearest neighbor of the \(i\)th observation and \(m\) is the total number of observations.
If side_info
is a factor the kappa index (\(\kappa\)) is
used instead the RMSD. It is computed as follows:
where both \(p_o\) and \(p_e\) are two different agreement indices between the the side information of the observations and the side information of their corresponding nearest observations (i.e. most similar observations). While \(p_o\) is the relative agreement \(p_e\) is the the agreement expected by chance.
This functions accepts vectors to be passed to argument d
, in this
case, the vector must represent the lower triangle of a dissimilarity matrix
(e.g. as returned by the stats::dist()
function of stats
).
Value
sim_eval
returns a list with the following components:
"
eval
: either the RMSD (and the correlation coefficient) or the kappa indexfirst_nn
: a matrix containing the original side informative variable in the first half of the columns, and the side informative values of the corresponding nearest neighbors in the second half of the columns.
Author(s)
References
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex datasets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Examples
library(prospectr)
data(NIRsoil)
sg <- savitzkyGolay(NIRsoil$spc, p = 3, w = 11, m = 0)
# Replace the original spectra with the filtered ones
NIRsoil$spc <- sg
Yr <- NIRsoil$Nt[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]
# Example 1
# Compute a principal components distance
pca_d <- ortho_diss(Xr, pc_selection = list("manual", 8))$dissimilarity
# Example 1.1
# Evaluate the distance matrix on the baisis of the
# side information (Yr) associated with Xr
se <- sim_eval(pca_d, side_info = as.matrix(Yr))
# The final evaluation results
se$eval
# The final values of the side information (Yr) and the values of
# the side information corresponding to the first nearest neighbors
# found by using the distance matrix
se$first_nn
# Example 1.2
# Evaluate the distance matrix on the basis of two side
# information (Yr and Yr2)
# variables associated with Xr
Yr_2 <- NIRsoil$CEC[as.logical(NIRsoil$train)]
se_2 <- sim_eval(d = pca_d, side_info = cbind(Yr, Yr_2))
# The final evaluation results
se_2$eval
# The final values of the side information variables and the values
# of the side information variables corresponding to the first
# nearest neighbors found by using the distance matrix
se_2$first_nn
# Example 2
# Evaluate the distances produced by retaining different number of
# principal components (this is the same principle used in the
# optimized principal components approach ("opc"))
# first project the data
pca_2 <- ortho_projection(Xr, pc_selection = list("manual", 30))
results <- matrix(NA, pca_2$n_components, 3)
colnames(results) <- c("pcs", "rmsd", "r")
results[, 1] <- 1:pca_2$n_components
for (i in 1:pca_2$n_components) {
ith_d <- f_diss(pca_2$scores[, 1:i, drop = FALSE], scale = TRUE)
ith_eval <- sim_eval(ith_d, side_info = as.matrix(Yr))
results[i, 2:3] <- as.vector(ith_eval$eval)
}
plot(results)
# Example 3
# Example 3.1
# Evaluate a dissimilarity matrix computed using the correlation
# method
cd <- cor_diss(Xr)
eval_corr_diss <- sim_eval(cd, side_info = as.matrix(Yr))
eval_corr_diss$eval