R: A function for searching in a given reference set the...

search_neighbors {resemble}

R Documentation

A function for searching in a given reference set the neighbors of another given set of observations (search_neighbors)

Description

This function searches in a reference set the neighbors of the observations provided in another set.

Usage

search_neighbors(Xr, Xu, diss_method = c("pca", "pca.nipals", "pls", "mpls",
                                         "cor", "euclid", "cosine", "sid"),
                 Yr = NULL, k, k_diss, k_range, spike = NULL,
                 pc_selection = list("var", 0.01),
                 return_projection = FALSE, return_dissimilarity = FALSE,
                 ws = NULL,
                 center = TRUE, scale = FALSE,
                 documentation = character(), ...)

Arguments

`Xr`	a matrix of reference (spectral) observations where the neighbor search is to be conducted. See details.
`Xu`	an optional matrix of (spectral) observations for which its neighbors are to be searched in `Xr`. Default is `NULL`. See details.
`diss_method`	a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbors of each observation. `"pca"`: Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of `Xr` (and `Xu` if supplied). PC projection is done using the singular value decomposition (SVD) algorithm. See `ortho_diss` function. `"pca.nipals"`: Mahalanobis distance computed on the matrix of scores of a Principal Component (PC) projection of `Xr` (and `Xu` if supplied). PC projection is done using the non-linear iterative partial least squares (niapls) algorithm. See `ortho_diss` function. `"pls"`: Mahalanobis distance computed on the matrix of scores of a partial least squares projection of `Xr` (and `Xu` if supplied). In this case, `Yr` is always required. See `ortho_diss` function. `"mpls"`: Mahalanobis distance computed on the matrix of scores of a modified partial least squares projection (Shenk and Westerhaus, 1991; Westerhaus, 2014) of `Xr` (and `Xu` if provided). In this case, `Yr` is always required. See `ortho_diss` function. `"cor"`: correlation coefficient between observations. See `cor_diss` function. `"euclid"`: Euclidean distance between observations. See `f_diss` function. `"cosine"`: Cosine distance between observations. See `f_diss` function. `"sid"`: spectral information divergence between observations. See `sid` function.
`Yr`	a numeric matrix of `n` observations used as side information of `Xr` for the `ortho_diss` methods (i.e. `pca`, `pca.nipals` or `pls`). It is required when: `diss_method = "pls"` `diss_method = "pca"` with `"opc"` used as the method in the `pc_selection` argument. See `ortho_diss()`.
`k`	an integer value indicating the k-nearest neighbors of each observation in `Xu` that must be selected from `Xr`.
`k_diss`	an integer value indicating a dissimilarity treshold. For each observation in `Xu`, its nearest neighbors in `Xr` are selected as those for which their dissimilarity to `Xu` is below this `k_diss` threshold. This treshold depends on the corresponding dissimilarity metric specified in `diss_method`. Either `k` or `k_diss` must be specified.
`k_range`	an integer vector of length 2 which specifies the minimum (first value) and the maximum (second value) number of neighbors to be retained when the `k_diss` is given.
`spike`	a vector of integers (with positive and/or negative values) indicating what observations in `Xr` (and `Yr`) must be forced into or avoided in the neighborhoods.
`pc_selection`	a list of length 2 to be passed onto the `ortho_diss` methods. It is required if the method selected in `diss_method` is any of `"pca"`, `"pca.nipals"` or `"pls"`. This argument is used for optimizing the number of components (principal components or pls factors) to be retained. This list must contain two elements in the following order: `method` (a character indicating the method for selecting the number of components) and `value` (a numerical value that complements the selected method). The methods available are: `"opc"`: optimized principal component selection based on Ramirez-Lopez et al. (2013a, 2013b). The optimal number of components (of set of observations) is the one for which its distance matrix minimizes the differences between the `Yr` value of each observation and the `Yr` value of its closest observation. In this case `value` must be a value (larger than 0 and below the minimum dimension of `Xr` or `Xr` and `Xu` combined) indicating the maximum number of principal components to be tested. See the `ortho_projection` function for more details. `"cumvar"`: selection of the principal components based on a given cumulative amount of explained variance. In this case, `value` must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of cumulative variance that the combination of retained components should explain. `"var"`: selection of the principal components based on a given amount of explained variance. In this case, `value` must be a value (larger than 0 and below or equal to 1) indicating the minimum amount of variance that a single component should explain in order to be retained. `"manual"`: for manually specifying a fix number of principal components. In this case, `value` must be a value (larger than 0 and below the minimum dimension of `Xr` or `Xr` and `Xu` combined) indicating the minimum amount of variance that a component should explain in order to be retained. The default is `list(method = "var", value = 0.01)`. Optionally, the `pc_selection` argument admits `"opc"` or `"cumvar"` or `"var"` or `"manual"` as a single character string. In such a case the default `"value"` when either `"opc"` or `"manual"` are used is 40. When `"cumvar"` is used the default `"value"` is set to 0.99 and when `"var"` is used, the default `"value"` is set to 0.01.
`return_projection`	a logical indicating if the projection(s) must be returned. Projections are used if the `ortho_diss` methods are called (i.e. `method = "pca"`, `method = "pca.nipals"` or `method = "pls"`).
`return_dissimilarity`	a logical indicating if the dissimilarity matrix used for neighbor search must be returned.
`ws`	an odd integer value which specifies the window size, when `diss_method = cor` (`cor_diss` method) for moving correlation dissimilarity. If `ws = NULL` (default), then the window size will be equal to the number of variables (columns), i.e. instead moving correlation, the normal correlation will be used. See `cor_diss` function.
`center`	a logical indicating if the `Xr` and `Xu` matrices must be centered. If `Xu` is provided the data is centered around the mean of the pooled `Xr` and `Xu` matrices (\(Xr \cup Xu\)). For dissimilarity computations based on `diss_method = pls`, the data is always centered.
`scale`	a logical indicating if the `Xr` and `Xu` matrices must be scaled. If `Xu` is provided the data is scaled based on the standard deviation of the the pooled `Xr` and `Xu` matrices (\(Xr \cup Xu\)). If `center = TRUE`, scaling is applied after centering.
`documentation`	an optional character string that can be used to describe anything related to the `mbl` call (e.g. description of the input data). Default: `character()`. NOTE: his is an experimental argument.
`...`	further arguments to be passed to the `dissimilarity` function. See details.

Details

This function may be specially useful when the reference set (Xr) is very large. In some cases the number of observations in the reference set can be reduced by removing irrelevant observations (i.e. observations that are not neighbors of a particular target set). For example, this fucntion can be used to reduce the size of the reference set before before running the mbl function.

This function uses the dissimilarity fucntion to compute the dissimilarities between Xr and Xu. Arguments to dissimilarity as well as further arguments to the functions used inside dissimilarity (i.e. ortho_diss cor_diss f_diss sid) can be passed to those functions as additional arguments (i.e. ...).

If no matrix is passed to Xu, the neighbor search is conducted for the observations in Xr that are found whiting that matrix. If a matrix is passed to Xu, the neighbors of Xu are searched in the Xr matrix.

Value

a list containing the following elements:

neighbors_diss: a matrix of the Xr dissimilarity scores corresponding to the neighbors of each Xr observation (or Xu observation, in case Xu was supplied). The neighbor dissimilarity scores are organized by columns and are sorted in ascending order.
neighbors: a matrix of the Xr indices corresponding to the neighbors of each observation in Xu. The neighbor indices are organized by columns and are sorted in ascending order by their dissimilarity score.
unique_neighbors: a vector of the indices in Xr identified as neighbors of any observation in Xr (or in Xu, in case it was supplied). This is obtained by converting the neighbors matrix into a vector and applying the unique function.
k_diss_info: a data.table that is returned only if the k_diss argument was used. It comprises three columns, the first one (Xr_index or Xu_index) indicates the index of the observations in Xr (or in Xu, in case it was suppplied), the second column (n_k) indicates the number of neighbors found in Xr and the third column (final_n_k) indicates the final number of neighbors selected bounded by k_range. argument.
dissimilarity: If return_dissimilarity = TRUE the dissimilarity object used (as computed by the dissimilarity function.
projection: an ortho_projection object. Only output if return_projection = TRUE and if diss_method = "pca", diss_method = "pca.nipals" or diss_method = "pls".
This object contains the projection used to compute the dissimilarity matrix. In case of local dissimilarity matrices, the projection corresponds to the global projection used to select the neighborhoods. (see ortho_diss function for further details).

Author(s)

Leonardo Ramirez-Lopez.

References

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.

Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.

Examples


library(prospectr)

data(NIRsoil)

Xu <- NIRsoil$spc[!as.logical(NIRsoil$train), ]
Yu <- NIRsoil$CEC[!as.logical(NIRsoil$train)]
Yr <- NIRsoil$CEC[as.logical(NIRsoil$train)]
Xr <- NIRsoil$spc[as.logical(NIRsoil$train), ]

Xu <- Xu[!is.na(Yu), ]
Yu <- Yu[!is.na(Yu)]

Xr <- Xr[!is.na(Yr), ]
Yr <- Yr[!is.na(Yr)]

# Identify the neighbor observations using the correlation dissimilarity and
# default parameters
# (In this example all the observations in Xr belong at least to the
# first 100 neighbors of one observation in Xu)
ex1 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "cor",
  k = 40
)

# Identify the neighbor observations using principal component (PC)
# and partial least squares (PLS) dissimilarities, and using the "opc"
# approach for selecting the number of components
ex2 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pca",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE
)

# Observations that do not belong to any neighborhood
seq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex2$unique_neighbors]

ex3 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pls",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE
)
# Observations that do not belong to any neighborhood
seq(1, nrow(Xr))[!seq(1, nrow(Xr)) %in% ex3$unique_neighbors]

# Identify the neighbor observations using local PC dissimialrities
# Here, 150 neighbors are used to compute a local dissimilarity matrix
# and then this matrix is used to select 50 neighbors
ex4 <- search_neighbors(
  Xr = Xr, Xu = Xu,
  diss_method = "pls",
  Yr = Yr, k = 50,
  pc_selection = list("opc", 40),
  scale = TRUE,
  .local = TRUE,
  pre_k = 150
)

[Package resemble version 2.2.3 Index]