R: Polynomial regression with oracle variable selection

polywog {polywog}

R Documentation

Polynomial regression with oracle variable selection

Description

Fits a regression model using a polynomial basis expansion of the input variables, with penalization via the adaptive LASSO or SCAD to provide oracle variable selection.

Usage

polywog(formula, data, subset, weights, na.action, degree = 3,
  family = c("gaussian", "binomial"), method = c("alasso", "scad"),
  penwt.method = c("lm", "glm"), unpenalized = character(0),
  .parallel = FALSE, boot = 0, control.boot = control.bp(.parallel =
  .parallel), lambda = NULL, nlambda = 100, lambda.min.ratio = 1e-04,
  nfolds = 10, foldid = NULL, thresh = ifelse(method == "alasso", 1e-07,
  0.001), maxit = ifelse(method == "alasso", 1e+05, 5000), model = TRUE,
  X = FALSE, y = FALSE)

Arguments

`formula`	model formula specifying the response and input variables. See "Details" for more information.
`data`	a data frame, list or environment containing the variables specified in the model formula.
`subset`	an optional vector specifying a subset of observations to be used in fitting.
`weights`	an optional vector specifying weights for each observation to be used in fitting.
`na.action`	a function specifying what to do with observations containing `NA`s (default `na.omit`).
`degree`	integer specifying the degree of the polynomial expansion of the input variables.
`family`	`"gaussian"` (default) or `"binomial"` for logistic regression (binary response only).
`method`	variable selection method: `"alasso"` (default) for adaptive LASSO or `"scad"` for SCAD. You can also select `method = "none"` to return the model matrix and other information without fitting.
`penwt.method`	estimator for obtaining first-stage estimates in logistic models when `method = "alasso"`: `"lm"` (default) for a linear probability model, `"glm"` for logistic regression.
`unpenalized`	names of model terms to be exempt from the adaptive penalty (only available when `method = "alasso"`).
`.parallel`	logical: whether to perform k-fold cross-validation in parallel (only available when `method = "alasso"`). See "Details" below for more information on parallel computation.
`boot`	number of bootstrap iterations (0 for no bootstrapping).
`control.boot`	list of arguments to be passed to `bootPolywog` when bootstrapping; see `control.bp`.
`lambda`	a vector of values from which the penalty factor is to be selected via k-fold cross-validation. `lambda` is left unspecified by default, in which case a sequence of values is generated automatically, controlled by the `nlambda` and `lambda.min.ratio` arguments. Naturally, k-fold cross-validation is skipped if `lambda` contains exactly one value.
`nlambda`	number of values of the penalty factor to examine via cross-validation if `lambda` is not specified in advance; see "Details".
`lambda.min.ratio`	ratio of the lowest value to the highest in the generated sequence of values of the penalty factor if `lambda` is not specified; see "Details".
`nfolds`	number of folds to use in cross-validation to select the penalization factor.
`foldid`	optional vector manually assigning fold numbers to each observation used for fitting (only available when `method = "alasso"`).
`thresh`	convergence threshold, passed as the `thresh` argument to `glmnet` when `method = "alasso"` and as the `eps` argument to `ncvreg` when `method = "scad"`.
`maxit`	maximum number of iterations to allow in adaptive LASSO or SCAD fitting.
`model`	logical: whether to include the model frame in the returned object.
`X`	logical: whether to include the raw design matrix (i.e., the matrix of input variables prior to taking their polynomial expansion) in the returned object.
`y`	logical: whether to include the response variable in the returned object.

Details

The design matrix for the regression is a polynomial basis expansion of the matrix of raw input variables. This includes all powers and interactions of the input variables up to the specified degree. For example, the following terms will be included in polywog(y ~ x1 + x2, degree = 3, ...):

terms of degree 0: intercept
terms of degree 1: x1, x2
terms of degree 2: x1^2, x2^2, x1*x2
terms of degree 3: x1^3, x2^3, x1*x2^2, x1^2*x2

To exclude certain terms from the basis expansion, use a model formula like y ~ x1 + x2 | z1 + z2. Only the degree 1 terms of z1 and z2 will be included.

It is possible that the "raw" basis expansion will be rank-deficient, such as if there are binary input variables (in which case x_i = x_i^n for all n > 0). The procedure detects collinearity via qr and removes extraneous columns before fitting.

For both the adaptive LASSO and SCAD, the penalization factor \lambda is chosen by k-fold cross-validation. The selected value minimizes the average mean squared error of out-of-sample fits. (To select both \lambda and the polynomial degree simultaneously via cross-validation, see cv.polywog.)

The cross-validation process may be run in parallel via foreach by registering an appropriate backend and specifying .parallel = TRUE. The appropriate backend is system-specific; see foreach for information on selecting and registering a backend. The bootstrap iterations may also be run in parallel by specifying control.boot = control.bp(.parallel = TRUE).

Value

An object of class "polywog", a list containing:

coefficients

the estimated coefficients.

lambda

value of the penalty factor \lambda used to fit the final model.

lambda.cv

a list containing the results of the cross-validation procedure used to select the penalty factor:

lambda: values of the penalty factor tested in cross-validation.
cvError: out-of-fold prediction error corresponding to each value of lambda.
lambdaMin: value of lambda with the minimal cross-validation error.
errorMin: minimized value of the cross-validation error.

fitted.values

the fitted mean values for each observation used in fitting.

lmcoef

coefficients from an unpenalized least-squares regression of the response variable on the polynomial expansion of the input variables.

penwt

adaptive weight given to each term in the LASSO penalty (NULL for models fit via SCAD).

formula

model formula, as a Formula object.

degree

degree of the polynomial basis expansion.

family

model family, "gaussian" or "binomial".

weights

observation weights if specified.

method

the specified regularization method.

penwt.method

the specified method for calculating the adaptive LASSO weights (NULL for models fit via SCAD).

unpenalized

logical vector indicating which terms were not included in the LASSO penalty.

thresh

convergence threshold used in fitting.

maxit

iteration limit used in fitting.

terms

the terms object used to construct the model frame.

polyTerms

a matrix indicating the power of each raw input term (columns) in each term of the polynomial expansion used in fitting (rows).

nobs

the number of observations used to fit the model.

na.action

information on how NA values in the input data were handled.

xlevels

levels of factor variables used in fitting.

varNames

names of the raw input variables included in the model formula.

call

the original function call.

model

if model = TRUE, the model frame used in fitting; otherwise NULL.

X

if X = TRUE, the raw model matrix (i.e., prior to taking the polynomial expansion); otherwise NULL. For calculating the expanded model matrix, see model.matrix.polywog.

y

if y = TRUE, the response variable used in fitting; otherwise NULL.

boot.matrix

if boot > 0, a sparse matrix of class "dgCMatrix" where each column is the estimate from a bootstrap replicate. See bootPolywog for more information on bootstrapping.

Author(s)

Brenton Kenkel and Curtis S. Signorino

References

Brenton Kenkel and Curtis S. Signorino. 2012. "A Method for Flexible Functional Form Estimation: Bootstrapped Basis Regression with Variable Selection." Typescript, University of Rochester.

Examples

## Using occupational prestige data
data(Prestige, package = "carData")
Prestige <- transform(Prestige, income = income / 1000)

## Fit a polywog model with bootstrap iterations
## (note: using low convergence threshold to shorten computation time of the
## example, *not* recommended in practice!)
set.seed(22)
fit1 <- polywog(prestige ~ education + income + type,
                data = Prestige,
                degree = 2,
                boot = 5,
                thresh = 1e-4)

## Basic information
print(fit1)
summary(fit1)

## See how fitted values change with education holding all else fixed
predVals(fit1, "education", n = 10)

## Plot univariate relationships
plot(fit1)

## Use SCAD instead of adaptive LASSO
fit2 <- update(fit1, method = "scad", thresh = 1e-3)
cbind(coef(fit1), coef(fit2))

[Package polywog version 0.4-1 Index]