R: Local distance-based linear model

ldblm {dbstats}

R Documentation

Local distance-based linear model

Description

ldblm is a localized version of a distance-based linear model. As in the global model dblm, explanatory information is coded as distances between individuals.

Neighborhood definition for localizing is done by the (semi)metric dist1 whereas a second (semi)metric dist2 (which may coincide with dist1) is used for distance-based prediction. Both dist1 and dist2 can either be computed from observed explanatory variables or directly input as a squared distances matrix or as a Gram matrix. The response is a continuous variable as in the ordinary linear model. The model allows for a mixture of continuous and qualitative explanatory variables or, in fact, from more general quantities such as functional data.

Notation convention: in distance-based methods we must distinguish observed explanatory variables which we denote by Z or z, from Euclidean coordinates which we denote by X or x. For explanation on the meaning of both terms see the bibliography references below.

Usage


## S3 method for class 'formula'
ldblm(formula,data,...,kind.of.kernel=1,
        metric1="euclidean",metric2=metric1,method.h="GCV",weights,
        user.h=NULL,h.range=NULL,noh=10,k.knn=3,rel.gvar=0.95,eff.rank=NULL)

## S3 method for class 'dist'
ldblm(dist1,dist2=dist1,y,kind.of.kernel=1,
        method.h="GCV",weights,user.h=quantile(dist1,.25),
        h.range=quantile(as.matrix(dist1),c(.05,.5)),noh=10,
        k.knn=3,rel.gvar=0.95,eff.rank=NULL,...)  

## S3 method for class 'D2'
ldblm(D2.1,D2.2=D2.1,y,kind.of.kernel=1,method.h="GCV",
        weights,user.h=quantile(D2.1,.25)^.5,
        h.range=quantile(as.matrix(D2.1),c(.05,.5))^.5,noh=10,k.knn=3,
        rel.gvar=0.95,eff.rank=NULL,...) 
         
## S3 method for class 'Gram'
ldblm(G1,G2=G1,y,kind.of.kernel=1,method.h="GCV",
        weights,user.h=NULL,h.range=NULL,noh=10,k.knn=3,rel.gvar=0.95,
        eff.rank=NULL,...)

Arguments

`formula`	an object of class `formula`. A formula of the form `y~Z`. This argument is a remnant of the `loess` function, kept for compatibility.
`data`	an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X).
`y`	(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, matrix or data.frame.
`dist1`	a `dist` or `dissimilarity` class object. Distances between observations, used for neighborhood localizing definition. Weights for observations are computed as a decreasing function of their `dist1` distances to the neighborhood center, e.g. a new observation whose reoponse has to be predicted. These weights are then entered to a `dblm`, where distances are evaluated with `dist2`.
`dist2`	a `dist` or `dissimilarity` class object. Distances between observations, used for fitting `dblm`. Default `dist2=dist1`.
`D2.1`	a `D2` class object. Squared distances matrix between individuals. One of the alternative ways of entering distance information to a function. See the Details section in `dblm`. See above `dist1` for explanation of its role in this function.
`D2.2`	a `D2` class object. Squared distances between observations. One of the alternative ways of entering distance information to a function. See the Details section in `dblm`. See above `dist2` for explanation of its role in this function. Default `D2.2=D2.1`.
`G1`	a `Gram` class object. Doubly centered inner product matrix associated with the squared distances matrix `D2.1`.
`G2`	a `Gram` class object. Doubly centered inner product matrix associated with the squared distances matrix `D2.2`. Default `G2=G1`
`kind.of.kernel`	integer number between 1 and 6 which determines the user's choice of smoothing kernel. (1) Epanechnikov (Default), (2) Biweight, (3) Triweight, (4) Normal, (5) Triangular, (6) Uniform.
`metric1`	metric function to be used when computing `dist1` from observed explanatory variables. One of `"euclidean"` (default), `"manhattan"`, or `"gower"`.
`metric2`	metric function to be used when computing `dist2` from observed explanatory variables. One of `"euclidean"` (default), `"manhattan"`, or `"gower"`.
`method.h`	sets the method to be used in deciding the optimal bandwidth h. There are five different methods, `AIC`, `BIC`, `OCV`, `GCV` (default) and `user.h`. `OCV` and `GCV` take the optimal bandwidth minimizing a cross-validatory quantity (either `ocv` or `gcv`). `AIC` and `BIC` take the optimal bandwidth minimizing, respectively, the Akaike or Bayesian Information Criterion (see `AIC` for more details). When `method.h` is `user.h`, the bandwidth is explicitly set by the user through the `user.h` optional parameter which, in this case, becomes mandatory.
`weights`	an optional numeric vector of weights to be used in the fitting process. By default all individuals have the same weight.
`user.h`	global bandwidth `user.h`, set by the user, controlling the size of the local neighborhood of Z. Smoothing parameter (Default: 1st quartile of all the distances d(i,j) in `dist1`). Applies only if `method.h="user.h"`.
`h.range`	a vector of length 2 giving the range for automatic bandwidth choice. (Default: quantiles 0.05 and 0.5 of d(i,j) in `dist1`).
`noh`	number of bandwidth `h` values within `h.range` for automatic bandwidth choice (if `method.h!="user.h"`).
`k.knn`	minimum number of observations with positive weight in neighborhood localizing. To avoid runtime errors due to a too small bandwidth originating neighborhoods with only one observation. By default `k.nn=3`.
`rel.gvar`	relative geometric variability (a real number between 0 and 1). In each `dblm` iteration, take the lowest effective rank, with a relative geometric variability higher or equal to `rel.gvar`. Default value (`rel.gvar=0.95`) uses the 95% of the total variability.
`eff.rank`	integer between 1 and the number of observations minus one. Number of Euclidean coordinates used for model fitting in each `dblm` iteration. If specified its value overrides `rel.gvar`. When `eff.rank=NULL` (default), calls to `dblm` are made with `method=rel.gvar`.
`...`	arguments passed to or from other methods to the low level.

Details

There are two semi-metrics involved in local linear distance-based estimation: dist1 and dist2. Both semi-metrics can coincide. For instance, when dist1=||xi-xj|| and dist2=||(xi,xi^2,xi^3)-(xj,xj^2,xj^3)|| the estimator for new observations coincides with fitting a local cubic polynomial regression.

The set of bandwidth h values checked in automatic bandwidth choice is defined by h.range and noh, together with k.knn. For each h in it a local linear model is fitted and the optimal h is decided according to the statistic specified in method.h.

kind.of.kernel designates which kernel function is to be used in determining individual weights from dist1 values. See density for more information.

Value

A list of class ldblm containing the following components:

`residuals`	the residuals (response minus fitted values).
`fitted.values`	the fitted mean values.
`h.opt`	the optimal bandwidth h used in the fitting proces (`if method.h!=user.h`).
`S`	the Smoother hat projector.
`weights`	the specified weights.
`y`	the response variable used.
`call`	the matched call.
`dist1`	the distance matrix (object of class `"D2"` or `"dist"`) used to calculate the weights of the observations.
`dist2`	the distance matrix (object of class `"D2"` or `"dist"`) used to fit the `dblm`.

Note

Model fitting is repeated n times (n= number of observations) for each bandwidth (noh*n times). For a noh too large or a sample with many observations, the time of this function can be very high.

Author(s)

Boj, Eva <evaboj@ub.edu>, Caballe, Adria <adria.caballe@upc.edu>, Delicado, Pedro <pedro.delicado@upc.edu> and Fortiana, Josep <fortiana@ub.edu>

References

Boj E, Caballe, A., Delicado P, Esteve, A., Fortiana J (2016). Global and local distance-based generalized linear models. TEST 25, 170-195.

Boj E, Delicado P, Fortiana J (2010). Distance-based local linear regression for functional predictors. Computational Statistics and Data Analysis 54, 429-437.

Boj E, Grane A, Fortiana J, Claramunt MM (2007). Selection of predictors in distance-based regression. Communications in Statistics B - Simulation and Computation 36, 87-98.

Cuadras CM, Arenas C, Fortiana J (1996). Some computational aspects of a distance-based model for prediction. Communications in Statistics B - Simulation and Computation 25, 593-609.

Cuadras C, Arenas C (1990). A distance-based regression model for prediction with mixed data. Communications in Statistics A - Theory and Methods 19, 2261-2279.

Cuadras CM (1989). Distance analysis in discrimination and classification using both continuous and categorical variables. In: Y. Dodge (ed.), Statistical Data Analysis and Inference. Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.

Examples


# example to use of the ldblm function
n <- 100
p <- 1
k <- 5

Z <- matrix(rnorm(n*p),nrow=n)
b1 <- matrix(runif(p)*k,nrow=p)
b2 <- matrix(runif(p)*k,nrow=p)
b3 <- matrix(runif(p)*k,nrow=p)

s <- 1
e <- rnorm(n)*s


y <- Z%*%b1 + Z^2%*%b2 +Z^3%*%b3 + e

D2 <- as.matrix(dist(Z)^2)
class(D2) <- "D2"

ldblm1 <- ldblm(y~Z,kind.of.kernel=1,method="GCV",noh=3,k.knn=3)
ldblm2 <- ldblm(D2.1=D2,D2.2=D2,y,kind.of.kernel=1,method="user.h",k.knn=3)

[Package dbstats version 2.0.2 Index]