dblm {dbstats} | R Documentation |

## Distance-based linear model

### Description

`dblm`

is a variety of linear model where explanatory information
is coded as distances between individuals. These distances can either
be computed from observed explanatory variables or directly input as
a squared distances matrix. The response is a continuous variable as
in the ordinary linear model. Since distances can be computed from a mixture
of continuous and qualitative explanatory variables or,
in fact, from more general quantities, `dblm`

is a proper extension of
`lm`

.

Notation convention: in distance-based methods we must distinguish
*observed explanatory variables* which we denote by Z or z, from
*Euclidean coordinates* which we denote by X or x. For explanation
on the meaning of both terms see the bibliography references below.

### Usage

```
## S3 method for class 'formula'
dblm(formula,data,...,metric="euclidean",method="OCV",full.search=TRUE,
weights,rel.gvar=0.95,eff.rank)
## S3 method for class 'dist'
dblm(distance,y,...,method="OCV",full.search=TRUE,
weights,rel.gvar=0.95,eff.rank)
## S3 method for class 'D2'
dblm(D2,y,...,method="OCV",full.search=TRUE,weights,rel.gvar=0.95,
eff.rank)
## S3 method for class 'Gram'
dblm(G,y,...,method="OCV",full.search=TRUE,weights,rel.gvar=0.95,
eff.rank)
```

### Arguments

`formula` |
an object of class |

`data` |
an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X). |

`y` |
(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, matrix or data.frame. |

`distance` |
a |

`D2` |
a |

`G` |
a |

`metric` |
metric function to be used when computing distances from observed
explanatory variables.
One of |

`method` |
sets the method to be used in deciding the When method is When method is |

`full.search` |
sets which optimization procedure will be used to
minimize the modelling criterion specified in |

`weights` |
an optional numeric vector of weights to be used in the fitting process. By default all individuals have the same weight. |

`rel.gvar` |
relative geometric variability (real between 0 and 1). Take the
lowest effective rank with a relative geometric variability higher
or equal to |

`eff.rank` |
integer between 1 and the number of observations minus one.
Number of Euclidean coordinates used for model fitting. Applies only
if |

`...` |
arguments passed to or from other methods to the low level. |

### Details

The `dblm`

model uses the distance matrix between individuals
to find an appropriate prediction method.
There are many ways to compute and calculate this matrix, besides
the three included as parameters in this function.
Several packages in R also study this problem. In particular
`dist`

in the package `stats`

and `daisy`

in the package `cluster`

(the three metrics in `dblm`

call
the `daisy`

function).

Another way to enter a distance matrix to the model is through an object
of class `"D2"`

(containing the squared distances matrix).
An object of class `"dist"`

or `"dissimilarity"`

can
easily be transformed into one of class `"D2"`

. See `disttoD2`

.
Reciprocally, an object of class `"D2"`

can be transformed into one
of class `"dist"`

. See `D2toDist`

.

S3 method Gram uses the Doubly centered inner product matrix G=XX'.
Its also easily to transformed into one of class `"D2"`

.
See `D2toG`

and `GtoD2`

.

The weights array is adequate when responses for different individuals have different variances. In this case the weights array should be (proportional to) the reciprocal of the variances vector.

When using method `method="eff.rank"`

or `method="rel.gvar"`

,
a compromise between possible consequences of a bad choice has to be
reached. If the rank is too large, the model can be overfitted, possibly
leading to an increased prediction error for new cases
(even though R2 is higher). On the other hand, a small rank suggests
a model inadequacy (R2 is small). The other four methods are less error
prone (but still they do not guarantee good predictions).

### Value

A list of class `dblm`

containing the following components:

`residuals` |
the residuals (response minus fitted values). |

`fitted.values` |
the fitted mean values. |

`df.residuals` |
the residual degrees of freedom. |

`weights` |
the specified weights. |

`y` |
the response used to fit the model. |

`H` |
the hat matrix projector. |

`call` |
the matched call. |

`rel.gvar` |
the relative geometric variabiliy, used to fit the model. |

`eff.rank` |
the dimensions chosen to estimate the model. |

`ocv` |
the ordinary cross-validation estimate of the prediction error. |

`gcv` |
the generalized cross-validation estimate of the prediction error. |

`aic` |
the Akaike Value Criterium of the model (only if |

`bic` |
the Bayesian Value Criterium of the model (only if |

### Note

When the Euclidean distance is used the `dblm`

model reduces to the linear
model (`lm`

).

### Author(s)

Boj, Eva <evaboj@ub.edu>, Caballe, Adria <adria.caballe@upc.edu>, Delicado, Pedro <pedro.delicado@upc.edu> and Fortiana, Josep <fortiana@ub.edu>

### References

Boj E, Caballe, A., Delicado P, Esteve, A., Fortiana J (2016). *Global and local distance-based generalized linear models*.
TEST 25, 170-195.

Boj E, Delicado P, Fortiana J (2010). *Distance-based local linear regression for functional predictors*.
Computational Statistics and Data Analysis 54, 429-437.

Boj E, Grane A, Fortiana J, Claramunt MM (2007). *Selection of predictors in distance-based regression*.
Communications in Statistics B - Simulation and Computation 36, 87-98.

Cuadras CM, Arenas C, Fortiana J (1996). *Some computational aspects of a distance-based model
for prediction*. Communications in Statistics B - Simulation and Computation 25, 593-609.

Cuadras C, Arenas C (1990). *A distance-based regression model for prediction with mixed data*.
Communications in Statistics A - Theory and Methods 19, 2261-2279.

Cuadras CM (1989). *Distance analysis in discrimination and classification using both
continuous and categorical variables*. In: Y. Dodge (ed.), *Statistical Data Analysis and Inference*.
Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.

### See Also

`summary.dblm`

for summary.

`plot.dblm`

for plots.

`predict.dblm`

for predictions.

`ldblm`

for distance-based local linear models.

### Examples

```
# easy example to illustrate usage of the dblm function
n <- 100
p <- 3
k <- 5
Z <- matrix(rnorm(n*p),nrow=n)
b <- matrix(runif(p)*k,nrow=p)
s <- 1
e <- rnorm(n)*s
y <- Z%*%b + e
D<-dist(Z)
dblm1 <- dblm(D,y)
lm1 <- lm(y~Z)
# the same fitted values with the lm
mean(lm1$fitted.values-dblm1$fitted.values)
```

*dbstats*version 2.0.2 Index]