mbl {resemble} | R Documentation |
A function for memory-based learning (mbl)
Description
This function is implemented for memory-based learning (a.k.a. instance-based learning or local regression) which is a non-linear lazy learning approach for predicting a given response variable from a set of predictor variables. For each observation in a prediction set, a specific local regression is carried out based on a subset of similar observations (nearest neighbors) selected from a reference set. The local model is then used to predict the response value of the target (prediction) observation. Therefore this function does not yield a global regression model.
Usage
mbl(Xr, Yr, Xu, Yu = NULL, k, k_diss, k_range, spike = NULL,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = min(dim(Xr), 15)),
diss_method = "pca", diss_usage = "predictors", gh = TRUE,
pc_selection = list(method = "opc", value = min(dim(Xr), 40)),
control = mbl_control(), group = NULL, center = TRUE, scale = FALSE,
verbose = TRUE, documentation = character(), seed = NULL, ...)
Arguments
Xr |
a matrix of predictor variables of the reference data (observations in rows and variables in columns). |
Yr |
a numeric matrix of one column containing the values of the response variable corresponding to the reference data. |
Xu |
a matrix of predictor variables of the data to be predicted (observations in rows and variables in columns). |
Yu |
an optional matrix of one column containing the values of the
response variable corresponding to the data to be predicted. Default is
|
k |
a vector of integers specifying the sequence of k-nearest
neighbors to be tested. Either |
k_diss |
a numeric vector specifying the sequence of dissimilarity
thresholds to be tested for the selection of the nearest neighbors found in
|
k_range |
an integer vector of length 2 which specifies the minimum
(first value) and the maximum (second value) number of neighbors to be
retained when the |
spike |
an integer vector (with positive and/or negative values) indicating
the indices of observations in |
method |
an object of class |
diss_method |
a character string indicating the spectral dissimilarity metric to be used in the selection of the nearest neighbors of each observation. Options are:
Alternatively, a matrix of dissimilarities can also be passed to this
argument. This matrix is supposed to be a user-defined matrix
representing the dissimilarities between observations in |
diss_usage |
a character string specifying how the dissimilarity
information shall be used. The possible options are: |
gh |
a logical indicating if the global Mahalanobis distance (in the pls
score space) between each observation and the pls mean (centre) must be
computed. This metric is known as the GH distance in the literature. Note
that this computation is based on the number of pls components determined by
using the |
pc_selection |
a list of length 2 used for the computation of GH (if
The list
|
control |
a list created with the |
group |
an optional factor (or character vector vector
that can be coerced to |
center |
a logical if the predictor variables must be centred at each
local segment (before regression). In addition, if |
scale |
a logical indicating if the predictor variables must be scaled
to unit variance at each local segment (before regression). In addition, if
|
verbose |
a logical indicating whether or not to print a progress bar
for each observation to be predicted. Default is |
documentation |
an optional character string that can be used to
describe anything related to the |
seed |
an integer value containing the random number generator (RNG)
state for random number generation. This argument can be used for
reproducibility purposes (for random sampling) in the cross-validation
results. Default is |
... |
further arguments to be passed to the |
Details
The argument spike
can be used to indicate what reference observations
in Xr
must be kept in the neighborhood of every single Xu
observation. If a vector of length \(m\) is passed to this argument,
this means that the \(m\) original neighbors with the largest
dissimilarities to the target observations will be forced out of the
neighborhood. Spiking might be useful in cases where
some reference observations are known to be somehow related to the ones in
Xu
and therefore might be relevant for fitting the local models. See
Guerrero et al. (2010) for an example on the benefits of spiking.
The mbl
function uses the dissimilarity
function to
compute the dissimilarities between Xr
and Xu
. The dissimilarity
method to be used is specified in the diss_method
argument.
Arguments to dissimilarity
as well as further arguments to the
functions used inside dissimilarity
(i.e. ortho_diss
cor_diss
f_diss
sid
) can be passed to those functions by using ...
.
The diss_usage
argument is used to specify whether the dissimilarity
information must be used within the local regressions and, if so, how.
When diss_usage = "predictors"
the local (square symmetric)
dissimilarity matrix corresponding the selected neighborhood is used as
source of additional predictors (i.e the columns of this local matrix are
treated as predictor variables). In some cases this results in an improvement
of the prediction performance (Ramirez-Lopez et al., 2013a).
If diss_usage = "weights"
, the neighbors of the query point
(\(xu_{j}\)) are weighted according to their dissimilarity to
\(xu_{j}\) before carrying out each local regression. The following
tricubic function (Cleveland and Delvin, 1988; Naes et al., 1990) is used for
computing the final weights based on the measured dissimilarities:
where if \({xr_{i} \in }\) neighbors of \(xu_{j}\):
\[v_{j}(xu_{j}) = d(xr_{i}, xu_{j})\]otherwise:
\[v_{j}(xu_{j}) = 0\]In the above formulas \(d(xr_{i}, xu_{j})\) represents the
dissimilarity between the query point and each object in \(Xr\).
When diss_usage = "none"
is chosen the dissimilarity information is
not used.
The global Mahalanobis distance (a.k.a GH) is computed based on the scores
of a pls projection. A pls projection model is built with for {Yr}, {Xr}
and this model is used to obtain the pls scores of the Xu
observations. The Mahalanobis distance between each Xu
observation in
(the pls space) and the centre of Xr
is then computed. The number of
pls components is optimized based on the parameters passed to the
pc_selection
argument. In addition, the mbl
function also
reports the GH distance for the observations in Xr
.
Some aspects of the mbl process, such as the type of internal validation,
parameter tuning, what extra objects to return, permission for parallel
execution, prediction limits, etc, can be specified by using the
mbl_control
function.
By using the group
argument one can specify groups of observations
that have something in common (e.g. observations with very similar origin).
The purpose of group
is to avoid biased cross-validation results due
to pseudo-replication. This argument allows to select calibration points
that are independent from the validation ones. In this regard, when
validation_type = "local_cv"
(used in mbl_control
function), then the p
argument refers to the percentage of groups of
observations (rather than single observations) to be retained in each
sampling iteration at each local segment.
Value
a list
of class mbl
with the following components
(sorted either by k
or k_diss
):
call
: the call to mbl.cntrl_param
: the list with the control parameters passed to control.Xu_neighbors
: a list containing two elements: a matrix ofXr
indices corresponding to the neighbors ofXu
and a matrix of dissimilarities between eachXu
observation and its corresponding neighbor inXr
.dissimilarities
: a list with the method used to obtain the dissimilarity matrices and the dissimilarity matrix corresponding to \(D(Xr, Xu)\). This object is returned only if thereturn_dissimilarity
argument in thecontrol
list was set toTRUE
.n_predictions
: the total number of observations predicted.gh
: ifgh = TRUE
, a list containing the global Mahalanobis distance values for the observations inXr
andXu
as well as the results of the global pls projection object used to obtain the GH values.validation_results
: a list of validation results for "local cross validation" (returned if thevalidation_type
incontrol
list was set to"local_cv"
), "nearest neighbor validation" (returned if thevalidation_type
incontrol
list was set to"NNv"
) and "Yu prediction statistics" (returned ifYu
was supplied).“results
: a list of data tables containing the results of the predictions for each eitherk
ork_diss
. Each data table contains the following columns:o_index
: The index of the predicted observation.k_diss
: This column is only output if thek_diss
argument is used. It indicates the corresponding dissimilarity threshold for selecting the neighbors.k_original
: This column is only output if thek_diss
argument is used. It indicates the number of neighbors that were originally found when the given dissimilarity threshold is used.k
: This column indicates the final number of neighbors used.npls
: This column is only output if thepls
regression method was used. It indicates the final number of pls components used.min_pls
: This column is only output ifwapls
regression method was used. It indicates the final number of minimum pls components used. If no optimization was set, it retrieves the original minimum pls components passed to themethod
argument.max_pls
: This column is only output if thewapls
regression method was used. It indicates the final number of maximum pls components used. If no optimization was set, it retrieves the original maximum pls components passed to themethod
argument.yu_obs
: The input values given inYu
(the response variable corresponding to the data to be predicted). IfYu = NULL
, thenNA
s are retrieved.pred
: The predicted Yu values.yr_min_obs
: The minimum reference value (of the response variable) in the neighborhood.yr_max_obs
: The maximum reference value (of the response variable) in the neighborhood.index_nearest_in_Xr
: The index of the nearest neighbor found inXr
.index_farthest_in_Xr
: The index of the farthest neighbor found inXr
.y_nearest
: The reference value (Yr
) corresponding to the nearest neighbor found inXr
.y_nearest_pred
: This column is only output if the validation method in the object passed tocontrol
was set to"NNv"
. It represents the predicted value of the nearest neighbor observation found inXr
. This prediction come from model fitted with the remaining observations in the neighborhood of the target observation inXu
.loc_rmse_cv
: This column is only output if the validation method in the object passed tocontrol
was set to'local_cv'
. It represents the RMSE of the cross-validation computed for the neighborhood of the target observation inXu
.loc_st_rmse_cv
: This column is only output if the validation method in the object passed tocontrol
was set to'local_cv'
. It represents the standardized RMSE of the cross-validation computed for the neighborhood of the target observation inXu
.dist_nearest
: The distance to the nearest neighbor.dist_farthest
: The distance to the farthest neighbor.loc_n_components
: This column is only output if the dissimilarity method used is one of"pca"
,"pca.nipals"
or"pls"
and in addition the dissimilarities are requested to be computed locally by passing.local = TRUE
to thembl
function. See.local
argument in theortho_diss
function.
seed
: a value mirroring the one passed to seed.documentation
: a character string mirroring the one provided in thedocumentation
argument.
When the k_diss
argument is used, the printed results show a table
with a column named 'p_bounded
. It represents the percentage of
observations for which the neighbors selected by the given dissimilarity
threshold were outside the boundaries specified in the k_range
argument.
Author(s)
Leonardo Ramirez-Lopez and Antoine Stevens
References
Cleveland, W. S., and Devlin, S. J. 1988. Locally weighted regression: an approach to regression analysis by local fitting. Journal of the American Statistical Association, 83, 596-610.
Guerrero, C., Zornoza, R., Gómez, I., Mataix-Beneyto, J. 2010. Spiking of NIR regional models using observations from target sites: Effect of model size on prediction accuracy. Geoderma, 158(1-2), 66-77.
Naes, T., Isaksson, T., Kowalski, B. 1990. Locally weighted regression and scatter correction for near-infrared reflectance data. Analytical Chemistry 62, 664-673.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Stevens, A., Dematte, J.A.M., Scholten, T. 2013a. The spectrum-based learner: A new local approach for modeling soil vis-NIR spectra of complex data sets. Geoderma 195-196, 268-279.
Ramirez-Lopez, L., Behrens, T., Schmidt, K., Viscarra Rossel, R., Dematte, J. A. M., Scholten, T. 2013b. Distance and similarity-search metrics for use with soil vis-NIR spectra. Geoderma 199, 43-53.
Rasmussen, C.E., Williams, C.K. Gaussian Processes for Machine Learning. Massachusetts Institute of Technology: MIT-Press, 2006.
Shenk, J., Westerhaus, M., and Berzaghi, P. 1997. Investigation of a LOCAL calibration procedure for near infrared instruments. Journal of Near Infrared Spectroscopy, 5, 223-232.
See Also
mbl_control
, f_diss
,
cor_diss
, sid
, ortho_diss
,
search_neighbors
, local_fit
Examples
library(prospectr)
data(NIRsoil)
# Proprocess the data using detrend plus first derivative with Savitzky and
# Golay smoothing filter
sg_det <- savitzkyGolay(
detrend(NIRsoil$spc,
wav = as.numeric(colnames(NIRsoil$spc))
),
m = 1,
p = 1,
w = 7
)
NIRsoil$spc_pr <- sg_det
# split into training and testing sets
test_x <- NIRsoil$spc_pr[NIRsoil$train == 0 & !is.na(NIRsoil$CEC), ]
test_y <- NIRsoil$CEC[NIRsoil$train == 0 & !is.na(NIRsoil$CEC)]
train_y <- NIRsoil$CEC[NIRsoil$train == 1 & !is.na(NIRsoil$CEC)]
train_x <- NIRsoil$spc_pr[NIRsoil$train == 1 & !is.na(NIRsoil$CEC), ]
# Example 1
# A mbl implemented in Ramirez-Lopez et al. (2013,
# the spectrum-based learner)
# Example 1.1
# An exmaple where Yu is supposed to be unknown, but the Xu
# (spectral variables) are known
my_control <- mbl_control(validation_type = "NNv")
## The neighborhood sizes to test
ks <- seq(40, 140, by = 20)
sbl <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
k = ks,
method = local_fit_gpr(),
control = my_control,
scale = TRUE
)
sbl
plot(sbl)
get_predictions(sbl)
# Example 1.2
# If Yu is actually known...
sbl_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_gpr(),
control = my_control
)
sbl_2
plot(sbl_2)
# Example 2
# the LOCAL algorithm (Shenk et al., 1997)
local_algorithm <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "cor",
diss_usage = "none",
control = my_control
)
local_algorithm
plot(local_algorithm)
# Example 3
# A variation of the LOCAL algorithm (using the optimized pc
# dissmilarity matrix) and dissimilarity matrix as source of
# additional preditors
local_algorithm_2 <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "pca",
diss_usage = "predictors",
control = my_control
)
local_algorithm_2
plot(local_algorithm_2)
# Example 4
# Running the mbl function in parallel with example 2
n_cores <- 2
if (parallel::detectCores() < 2) {
n_cores <- 1
}
# Alternatively:
# n_cores <- parallel::detectCores() - 1
# if (n_cores == 0) {
# n_cores <- 1
# }
library(doParallel)
clust <- makeCluster(n_cores)
registerDoParallel(clust)
# Alernatively:
# library(doSNOW)
# clust <- makeCluster(n_cores, type = "SOCK")
# registerDoSNOW(clust)
# getDoParWorkers()
local_algorithm_par <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "cor",
diss_usage = "none",
control = my_control
)
local_algorithm_par
registerDoSEQ()
try(stopCluster(clust))
# Example 5
# Using local pls distances
with_local_diss <- mbl(
Xr = train_x,
Yr = train_y,
Xu = test_x,
Yu = test_y,
k = ks,
method = local_fit_wapls(min_pls_c = 3, max_pls_c = 15),
diss_method = "pls",
diss_usage = "predictors",
control = my_control,
.local = TRUE,
pre_k = 150,
)
with_local_diss
plot(with_local_diss)