R: Local distance-based generalized linear model

ldbglm {dbstats}

R Documentation

Local distance-based generalized linear model

Description

ldbglm is a localized version of a distance-based generalized linear model. As in the global model dbglm, explanatory information is coded as distances between individuals.

Neighborhood definition for localizing is done by the (semi)metric dist1 whereas a second (semi)metric dist2 (which may coincide with dist1) is used for distance-based prediction. Both dist1 and dist2 can either be computed from observed explanatory variables or directly input as a squared distances matrix or as a Gram matrix. Response and link function are as in the dbglm function for ordinary generalized linear models. The model allows for a mixture of continuous and qualitative explanatory variables or, in fact, from more general quantities such as functional data.

Notation convention: in distance-based methods we must distinguish observed explanatory variables which we denote by Z or z, from Euclidean coordinates which we denote by X or x. For explanation on the meaning of both terms see the bibliography references below.

Usage


## S3 method for class 'formula'
ldbglm(formula,data,...,family=gaussian(),kind.of.kernel=1,
        metric1="euclidean",metric2=metric1,method.h="GCV",weights,
        user.h=NULL,h.range=NULL,noh=10,k.knn=3,
        rel.gvar=0.95,eff.rank=NULL,maxiter=100,eps1=1e-10,
        eps2=1e-10)

## S3 method for class 'dist'
ldbglm(dist1,dist2=dist1,y,family=gaussian(),kind.of.kernel=1,
        method.h="GCV",weights,user.h=quantile(dist1,.25),
        h.range=quantile(as.matrix(dist1),c(.05,.5)),noh=10,k.knn=3,
        rel.gvar=0.95,eff.rank=NULL,maxiter=100,eps1=1e-10,eps2=1e-10,...)

## S3 method for class 'D2'
ldbglm(D2.1,D2.2=D2.1,y,family=gaussian(),kind.of.kernel=1,
        method.h="GCV",weights,user.h=quantile(D2.1,.25)^.5,
        h.range=quantile(as.matrix(D2.1),c(.05,.5))^.5,noh=10,
        k.knn=3,rel.gvar=0.95,eff.rank=NULL,maxiter=100,eps1=1e-10,
        eps2=1e-10,...) 

## S3 method for class 'Gram'
ldbglm(G1,G2=G1,y,kind.of.kernel=1,user.h=NULL,
        family=gaussian(),method.h="GCV",weights,h.range=NULL,noh=10,
        k.knn=3,rel.gvar=0.95,eff.rank=NULL,maxiter=100,eps1=1e-10,
        eps2=1e-10,...)

Arguments

`formula`	an object of class `formula`. A formula of the form `y~Z`. This argument is a remnant of the `loess` function, kept for compatibility.
`data`	an optional data frame containing the variables in the model (both response and explanatory variables, either the observed ones, Z, or a Euclidean configuration X).
`y`	(required if no formula is given as the principal argument). Response (dependent variable) must be numeric, matrix or data.frame.
`dist1`	a `dist` or `dissimilarity` class object. Distances between observations, used for neighborhood localizing definition. Weights for observations are computed as a decreasing function of their `dist1` distances to the neighborhood center, e.g. a new observation whose reoponse has to be predicted. These weights are then entered to a `dbglm`, where distances are evaluated with `dist2`.
`dist2`	a `dist` or `dissimilarity` class object. Distances between observations, used for fitting `dbglm`. Default `dist2=dist1`.
`D2.1`	a `D2` class object. Squared distances matrix between individuals. One of the alternative ways of entering distance information to a function. See the Details section in `dblm`. See above `dist1` for explanation of its role in this function.
`D2.2`	a `D2` class object. Squared distances between observations. One of the alternative ways of entering distance information to a function. See the Details section in `dblm`. See above `dist2` for explanation of its role in this function. Default `D2.2=D2.1`.
`G1`	a `Gram` class object. Doubly centered inner product matrix associated with the squared distances matrix `D2.1`.
`G2`	a `Gram` class object. Doubly centered inner product matrix associated with the squared distances matrix `D2.2`. Default `G2=G1`
`family`	a description of the error distribution and link function to be used in the model. This can be a character string naming a family function, a family function or the result of a call to a family function. (See `family` for details of family functions.)
`kind.of.kernel`	integer number between 1 and 6 which determines the user's choice of smoothing kernel. (1) Epanechnikov (Default), (2) Biweight, (3) Triweight, (4) Normal, (5) Triangular, (6) Uniform.
`metric1`	metric function to be used when computing `dist1` from observed explanatory variables. One of `"euclidean"` (default), `"manhattan"`, or `"gower"`.
`metric2`	metric function to be used when computing `dist2` from observed explanatory variables. One of `"euclidean"` (default), `"manhattan"`, or `"gower"`.
`method.h`	sets the method to be used in deciding the optimal bandwidth h. There are four different methods, `AIC`, `BIC`, `GCV` (default) and `user.h`. `GCV` take the optimal bandwidth minimizing a cross-validatory quantity. `AIC` and `BIC` take the optimal bandwidth minimizing, respectively, the Akaike or Bayesian Information Criterion (see `AIC` for more details). When `method.h` is `user.h`, the bandwidth is explicitly set by the user through the `user.h` optional parameter which, in this case, becomes mandatory.
`weights`	an optional numeric vector of weights to be used in the fitting process. By default all individuals have the same weight.
`user.h`	global bandwidth `user.h`, set by the user, controlling the size of the local neighborhood of Z. Smoothing parameter (Default: 1st quartile of all the distances d(i,j) in `dist1`). Applies only if `method.h="user.h"`.
`h.range`	a vector of length 2 giving the range for automatic bandwidth choice. (Default: quantiles 0.05 and 0.5 of d(i,j) in `dist1`).
`noh`	number of bandwidth `h` values within `h.range` for automatic bandwidth choice (if `method.h!="user.h"`).
`k.knn`	minimum number of observations with positive weight in neighborhood localizing. To avoid runtime errors due to a too small bandwidth originating neighborhoods with only one observation. By default `k.nn=3`.
`rel.gvar`	relative geometric variability (a real number between 0 and 1). In each `dblm` iteration, take the lowest effective rank, with a relative geometric variability higher or equal to `rel.gvar`. Default value (`rel.gvar=0.95`) uses the 95% of the total variability.
`eff.rank`	integer between 1 and the number of observations minus one. Number of Euclidean coordinates used for model fitting in each `dblm` iteration. If specified its value overrides `rel.gvar`. When `eff.rank=NULL` (default), calls to `dblm` are made with `method=rel.gvar`.
`maxiter`	maximum number of iterations in the iterated `dblm` algorithm. (Default = 100)
`eps1`	stopping criterion 1, `"DevStat"`: convergence tolerance `eps1`, a positive (small) number; the iterations converge when `\|dev - dev_{old}\|/(\|dev\|) < eps1`. Stationarity of deviance has been attained.
`eps2`	stopping criterion 2, `"mustat"`: convergence tolerance `eps2`, a positive (small) number; the iterations converge when `\|mu - mu_{old}\|/(\|mu\|) < eps2`. Stationarity of fitted.values `mu` has been attained.
`...`	arguments passed to or from other methods to the low level.

Details

The various possible ways for inputting the model explanatory information through distances, or their squares, etc., are the same as in dblm.

The set of bandwidth h values checked in automatic bandwidth choice is defined by h.range and noh, together with k.knn. For each h in it a local generalized linear model is fitted and the optimal h is decided according to the statistic specified in method.h.

kind.of.kernel designates which kernel function is to be used in determining individual weights from dist1 values. See density for more information.

For gamma distributions, the domain of the canonical link function is not the same as the permitted range of the mean. In particular, the linear predictor might be negative, obtaining an impossible negative mean. Should that event occur, dbglm stops with an error message. Proposed alternative is to use a non-canonical link function.

Value

A list of class ldbglm containing the following components:

`residuals`	the residuals (response minus fitted values).
`fitted.values`	the fitted mean values.
`h.opt`	the optimal bandwidth `h` used in the fitting proces (`if method.h!=user.h`).
`family`	the `family` object used.
`y`	the response variable used.
`S`	the Smoother hat projector.
`weights`	the specified weights.
`call`	the matched call.
`dist1`	the distance matrix (object of class `"D2"` or `"dist"`) used to calculate the weights of the observations.
`dist2`	the distance matrix (object of class `"D2"` or `"dist"`) used to fit the `dbglm`.

Objects of class "ldbglm" are actually of class c("ldbglm", "ldblm"), inheriting the plot.ldblm and summary.ldblm method from class "ldblm".

Note

Model fitting is repeated n times (n= number of observations) for each bandwidth (noh*n times). For a noh too large or a sample with many observations, the time of this function can be very high.

Author(s)

Boj, Eva <evaboj@ub.edu>, Caballe, Adria <adria.caballe@upc.edu>, Delicado, Pedro <pedro.delicado@upc.edu> and Fortiana, Josep <fortiana@ub.edu>

References

Boj E, Caballe, A., Delicado P, Esteve, A., Fortiana J (2016). Global and local distance-based generalized linear models. TEST 25, 170-195.

Boj E, Delicado P, Fortiana J (2010). Distance-based local linear regression for functional predictors. Computational Statistics and Data Analysis 54, 429-437.

Boj E, Grane A, Fortiana J, Claramunt MM (2007). Selection of predictors in distance-based regression. Communications in Statistics B - Simulation and Computation 36, 87-98.

Cuadras CM, Arenas C, Fortiana J (1996). Some computational aspects of a distance-based model for prediction. Communications in Statistics B - Simulation and Computation 25, 593-609.

Cuadras C, Arenas C (1990). A distance-based regression model for prediction with mixed data. Communications in Statistics A - Theory and Methods 19, 2261-2279.

Cuadras CM (1989). Distance analysis in discrimination and classification using both continuous and categorical variables. In: Y. Dodge (ed.), Statistical Data Analysis and Inference. Amsterdam, The Netherlands: North-Holland Publishing Co., pp. 459-473.

Examples


# example of ldbglm usage
 z <- rnorm(100)
 y <- rbinom(100, 1, plogis(z))
 D2 <- as.matrix(dist(z))^2
 class(D2) <- "D2"
 
 # Distance-based generalized linear model
 dbglm2 <- dbglm(D2,y,family=binomial(link = "logit"), method="rel.gvar")
 # Local Distance-based generalized linear model
 ldbglm2 <- ldbglm(D2,y=y,family=binomial(link = "logit"),noh=3)
 
 # check the difference of both
 sum((y-ldbglm2$fit)^2)
 sum((y-dbglm2$fit)^2)
 plot(z,y)
 points(z,ldbglm2$fit,col=3)
 points(z,dbglm2$fit,col=2)

[Package dbstats version 2.0.2 Index]