R: Auxiliary Nonlinear Variance Model

anlvm.fit {skedastic}

R Documentation

Auxiliary Nonlinear Variance Model

Description

Fits an Auxiliary Nonlinear Variance Model (ANLVM) to estimate the error variances of a heteroskedastic linear regression model.

Usage

anlvm.fit(
  mainlm,
  g,
  M = NULL,
  cluster = FALSE,
  varselect = c("none", "hettest", "cv.linear", "cv.cluster", "qgcv.linear",
    "qgcv.cluster"),
  nclust = c("elbow.swd", "elbow.mwd", "elbow.both"),
  clustering = NULL,
  param.init = function(q) stats::runif(n = q, min = -5, max = 5),
  maxgridrows = 20L,
  nconvstop = 3L,
  zerosallowed = FALSE,
  maxitql = 100L,
  tolql = 1e-08,
  nestedql = FALSE,
  reduce2homosked = TRUE,
  cvoption = c("testsetols", "partitionres"),
  nfolds = 5L,
  ...
)

Arguments

`mainlm`	Either an object of `class` `"lm"` (e.g., generated by `lm`), or a list of two objects: a response vector and a design matrix. The objects are assumed to be in that order, unless they are given the names `"X"` and `"y"` to distinguish them. The design matrix passed in a list must begin with a column of ones if an intercept is to be included in the linear model. The design matrix passed in a list should not contain factors, as all columns are treated 'as is'. For tests that use ordinary least squares residuals, one can also pass a vector of residuals in the list, which should either be the third object or be named `"e"`.
`g`	A numeric-valued function of one variable, or a character denoting the name of such a function. `"sq"` is allowed as a way of denoting `function(x) x ^ 2`.
`M`	An `n\times n` annihilator matrix. If `NULL` (the default), this will be calculated from the `mainlm` object
`cluster`	A logical; should the design matrix X be replaced with an `n\times n_c` matrix of ones and zeroes, with a single one in each row, indicating assignments of the `n` observations to `n_c` clusters using an agglomerative hierarchical clustering algorithm. In this case, the dimensionality of `\gamma` is `n_c` and not `p`. Defaults to `FALSE`
`varselect`	Either a character indicating how variable selection should be conducted, or an integer vector giving indices of columns of the predictor matrix (`model.matrix` of `mainlm`) to select. The vector must include `1L` for the intercept to be selected. If a character, it must be one of the following: `"none"`: No variable selection is conducted; `"hettest"`: Variable selection is conducted by applying a heteroskedasticity test with each feature in turn serving as the ‘deflator’ variable `"cv.linear"`: Variable selection is conducted by best subset selection on the auxiliary linear variance model (linear specification), using squared-error loss computed under `K`-fold cross-validation `"cv.cluster"`: Variable selection is conducted by best subset selection on the auxiliary linear variance model (clustering specification), using squared-error loss computed under `K`-fold cross-validation `"qgcv.linear"`: Variable selection is conducted by best subset selection on the auxiliary linear variance model (linear specification), using squared-error loss computed under quasi-generalised cross-validation `"qgcv.cluster"`: Variable selection is conducted by best subset selection on the auxiliary linear variance model (clustering specification), using squared-error loss computed under quasi-generalised cross-validation
`nclust`	A character indicating which elbow method to use to select the number of clusters (ignored if `cluster` is `FALSE`). Alternatively, an integer specifying the number of clusters
`clustering`	A list object of class `"doclust"`. If set to `NULL` (the default), such an object is generated (ignored if `cluster` is `FALSE`)
`param.init`	Specifies the initial values of the parameter vector to use in the Gauss-Newton fitting algorithm. This can either be a function for generating the initial values from a probability distribution, a list containing named objects corresponding to the arguments of `seq` (specifying a sequence of scalar values that will be passed to `expand.grid`), or a numeric vector specifying a single initial parameter vector
`maxgridrows`	An integer indicating the maximum number of initial values of the parameter vector to try, in case of `param.init` being a function or a list used to generate a grid. Defaults to `20L`.
`nconvstop`	An integer indicating how many times the quasi-likelihood estimation algorithm should converge before the grid search across different initial parameter values is truncated. Defaults to `3L`. If `nconvstop >= maxgridrows`, no early stopping rule will be used.
`zerosallowed`	A logical indicating whether 0 values are acceptable in the initial values of the parameter vector. Defaults to `FALSE`.
`maxitql`	An integer specifying the maximum number of iterations to run in the Gauss-Newton algorithm for quasi-likelihood estimation. Defaults to `100L`.
`tolql`	A double specifying the convergence criterion for the Gauss-Newton algorithm; defaults to `1e-8`. The criterion is applied to the `L_2` norm of the difference between parameter vectors in successive iterations.
`nestedql`	A logical indicating whether to use the nested updating step suggested in Seber and Wild (2003). Defaults to `FALSE` due to the large computation time required.
`reduce2homosked`	A logical indicating whether the homoskedastic error variance estimator `e'e/(n-p)` should be used if the variable selection procedure does not select any variables. Defaults to `TRUE`.
`cvoption`	A character, either `"testsetols"` or `"partitionres"`, indicating how to obtain the observed response values for each test fold when performing `K`-fold cross-validation on an ALVM. The default technique, `"testsetols"`, entails fitting a linear regression model to the test fold of observations from the original response vector `y` and predictor matrix `X`. The squared residuals from this regression are the observed responses that are predicted from the trained model to compute the cross-validated squared error loss function. Under the other technique, `"partitionres"`, the squared residuals from the full linear regression model are partitioned into training and test folds and the squared residuals in the test fold are the observed responses that are predicted for computation of the cross-validated loss.
`nfolds`	An integer specifying the number of folds `K` to use for cross-validation, if the `\lambda` and/or `n_c` hyperparameters are to be tuned using cross-validation. Defaults to `5L`. One must ensure that each test fold contains at least `p+1` observations if the `"testsetols"` technique is used with cross-validation, so that there are enough degrees of freedom to fit a linear model to the test fold.
`...`	Other arguments that can be passed to (non-exported) helper functions, namely: `greedy`, a logical passed to the functions implementing best subset selection, indicating whether or not to use a greedy search rather than exhaustive search for the best subset. Defaults to `FALSE`, but coerced to `TRUE` unconditionally if `p>9`. `distmetric`, a character specifying the distance metric to use in computing distance for the clustering algorithm. Corresponds to the `method` argument of `dist` and defaults to `"euclidean"` `linkage`, a character specifying the linkage rule to use in agglomerative hierarchical clustering. Corresponds to the `method` argument of `hclust` and defaults to `"complete"` `alpha`, a double specifying the significance level threshold to use when applying heteroskedasticity test for the purpose of feature selection in an ALVM; defaults to `0.1` `testname`, a character corresponding to the name of a function that performs a heteroskedasticity test. The function must either be one that takes a `deflator` argument or `breusch_pagan`. Defaults to `evans_king`

Details

The ANLVM model equation is

e_i^2=\displaystyle\sum_{k=1}^{n} g(X_{k\cdot}'\gamma) m_{ik}^2+u_i

, where e_i is the ith Ordinary Least Squares residual, X_{k\cdot} is a vector corresponding to the kth row of the n\times p design matrix X, m_{ik}^2 is the (i,k)th element of the annihilator matrix M=I-X(X'X)^{-1}X', u_i is a random error term, \gamma is a p-vector of unknown parameters, and g(\cdot) is a continuous, differentiable function that need not be linear in \gamma, but must be expressible as a function of the linear predictor X_{k\cdot}'\gamma. This method has been developed as part of the author's doctoral research project.

The parameter vector \gamma is estimated using the maximum quasi-likelihood method as described in section 2.3 of Seber and Wild (2003). The optimisation problem is solved numerically using a Gauss-Newton algorithm.

For further discussion of feature selection and the methods for choosing the number of clusters to use with the clustering version of the model, see alvm.fit.

Value

An object of class "anlvm.fit", containing the following:

coef.est, a vector of parameter estimates, \hat{\gamma}
var.est, a vector of estimates \hat{\omega} of the error variances for all observations
method, either "cluster" or "functionalform", depending on whether cluster was set to TRUE
ols, the lm object corresponding to the original linear regression model
fitinfo, a list containing three named objects, g (the heteroskedastic function), Msq (the elementwise-square of the annihilator matrix M), Z (the design matrix used in the ANLVM, after feature selection if applicable), and clustering (a list object with results of the clustering procedure, if applicable).
selectinfo, a list containing two named objects, varselect (the value of the eponymous argument), and selectedcols (a numeric vector with column indices of X that were selected, with 1 denoting the intercept column)
qlinfo, a list containing nine named objects: converged (a logical, indicating whether the Gauss-Newton algorithm converged for at least one initial value of the parameter vector), iterations (the number of Gauss-Newton iterations used to obtain the parameter estimates returned), Smin (the minimum achieved value of the objective function used in the Gauss-Newton routine), and six arguments passed to the function (nested, param.init, maxgridrows, nconvstop, maxitql, and tolql)

References

Seber GAF, Wild CJ (2003). Nonlinear Regression. Wiley, Hoboken, NJ.

Examples

mtcars_lm <- lm(mpg ~ wt + qsec + am, data = mtcars)
myanlvm <- anlvm.fit(mtcars_lm, g = function(x) x ^ 2,
 varselect = "qgcv.linear")

[Package skedastic version 2.0.2 Index]