R: Categorical Regression Splines

crs {crs}

R Documentation

Categorical Regression Splines

Description

crs computes a regression spline estimate of a one (1) dimensional dependent variable on an r-dimensional vector of continuous and categorical (factor/ordered) predictors (Ma and Racine (2013), Ma, Racine and Yang (2015)).

Usage

crs(...)
## Default S3 method:
crs(xz,
    y,
    degree = NULL,
    segments = NULL,
    include = NULL,
    kernel = TRUE,
    lambda = NULL,
    complexity = c("degree-knots","degree","knots"),
    knots = c("quantiles","uniform","auto"),
    basis = c("auto","additive","tensor","glp"),
    deriv = 0,
    data.return = FALSE,
    prune = FALSE,
    model.return = FALSE,
    tau = NULL,
    weights = NULL,
    ...)

## S3 method for class 'formula'
crs(formula,
    data = list(),
    degree = NULL,
    segments = NULL,
    include = NULL,
    degree.max = 10, 
    segments.max = 10, 
    degree.min = 0, 
    segments.min = 1, 
    cv.df.min = 1,
    cv = c("nomad","exhaustive","none"),
    cv.threshold = 1000,
    cv.func = c("cv.ls","cv.gcv","cv.aic"),
    kernel = TRUE,
    lambda = NULL,
    lambda.discrete = FALSE,
    lambda.discrete.num = 100,
    complexity = c("degree-knots","degree","knots"),
    knots = c("quantiles","uniform","auto"),
    basis = c("auto","additive","tensor","glp"),
    deriv = 0,
    data.return = FALSE,
    prune = FALSE,
    model.return = FALSE,
    restarts = 0,
    random.seed = 42,
    max.bb.eval = 10000,
    initial.mesh.size.real = "r1.0e-01",
    initial.mesh.size.integer = "1",
    min.mesh.size.real = paste(sqrt(.Machine$double.eps)),
    min.mesh.size.integer = 1, 
    min.poll.size.real = 1,
    min.poll.size.integer = 1, 
    opts=list(),
    nmulti = 5,
    tau = NULL,
    weights = NULL,
    singular.ok = FALSE,
    ...)

Arguments

`xz`	numeric (`x`) and or nominal/ordinal (`factor`/`ordered`) predictors (`z`)
`y`	a numeric vector of responses.
`degree`	integer/vector specifying the polynomial degree of the B-spline basis for each dimension of the continuous `x` (default `degree=3`, i.e. cubic spline)
`segments`	integer/vector specifying the number of segments of the B-spline basis for each dimension of the continuous `x` (i.e. number of knots minus one) (default `segments=1`, i.e. Bezier curve)
`include`	integer/vector specifying whether each of the nominal/ordinal (`factor`/`ordered`) predictors in `x` are included or omitted from the resulting estimate
`lambda`	a vector of bandwidths for each dimension of the categorical `z`
`lambda.discrete`	if `lambda.discrete=TRUE`, the bandwidth will be discretized into `lambda.discrete.num+1` points and `lambda` will be chosen from these points
`lambda.discrete.num`	a positive integer indicating the number of discrete values that lambda can assume - this parameter will only be used when `lambda.discrete=TRUE`
`formula`	a symbolic description of the model to be fit
`data`	an optional data frame containing the variables in the model
`cv`	a character string (default `cv="nomad"`) indicating whether to use nonsmooth mesh adaptive direct search, exhaustive search, or no search (i.e. use user supplied values for `degree`, `segments`, and `lambda`)
`cv.threshold`	an integer (default `cv.threshold=1000`) for simple problems with no categorical predictors (i.e. `kernel=FALSE` otherwise `optim`/`snomadr` search is necessary) such that, if the number of combinations of `degree`/`segments` is less than the threshold and `cv="nomad"`, instead use exhaustive search (`cv="exhaustive"`)
`cv.func`	a character string (default `cv.func="cv.ls"`) indicating which method to use to select smoothing parameters. `cv.gcv` specifies generalized cross-validation (Craven and Wahba (1979)), `cv.aic` specifies expected Kullback-Leibler cross-validation (Hurvich, Simonoff, and Tsai (1998)), and `cv.ls` specifies least-squares cross-validation
`kernel`	a logical value (default `kernel=TRUE`) indicating whether to use kernel smoothing or not
`degree.max`	the maximum degree of the B-spline basis for each of the continuous predictors (default `degree.max=10`)
`segments.max`	the maximum segments of the B-spline basis for each of the continuous predictors (default `segments.max=10`)
`degree.min`	the minimum degree of the B-spline basis for each of the continuous predictors (default `degree.min=0`)
`segments.min`	the minimum segments of the B-spline basis for each of the continuous predictors (default `segments.min=1`)
`cv.df.min`	the minimum degrees of freedom to allow when conducting NOMAD-based cross-validation (default `cv.df.min=1`)
`complexity`	a character string (default `complexity="degree-knots"`) indicating whether model ‘complexity’ is determined by the degree of the spline or by the number of segments (i.e. number of knots minus one). This option allows the user to use cross-validation to select either the spline degree (number of knots held fixed) or the number of knots (spline degree held fixed) or both the spline degree and number of knots For the continuous predictors the regression spline model employs either the additive or tensor product B-spline basis matrix for a multivariate polynomial spline via the B-spline routines in the GNU Scientific Library (https://www.gnu.org/software/gsl/) and the `tensor.prod.model.matrix` function
`knots`	a character string (default `knots="quantiles"`) specifying where knots are to be placed. ‘quantiles’ specifies knots placed at equally spaced quantiles (equal number of observations lie in each segment) and ‘uniform’ specifies knots placed at equally spaced intervals. If `knots="auto"`, the knot type will be automatically determined by cross-validation
`basis`	a character string (default `basis="auto"`) indicating whether the additive or tensor product B-spline basis matrix for a multivariate polynomial spline or generalized B-spline polynomial basis should be used. Note this can be automatically determined by cross-validation if `cv="nomad"` or `cv="exhaustive"` and `basis="auto"`, and is an ‘all or none’ proposition (i.e. interaction terms for all predictors or for no predictors given the nature of ‘tensor products’). Note also that if there is only one predictor this defaults to `basis="additive"` to avoid unnecessary computation as the spline bases are equivalent in this case
`deriv`	an integer `l` (default `deriv=0`) specifying whether to compute the univariate `l`th partial derivative for each continuous predictor (and difference in levels for each categorical predictor) or not and if so what order. Note that if `deriv` is higher than the spline degree of the associated continuous predictor then the derivative will be zero and a warning issued to this effect
`data.return`	a logical value indicating whether to return `x,z,y` or not (default `data.return=FALSE`)
`prune`	a logical value (default `prune=FALSE`) specifying whether the (final) model is to be ‘pruned’ using a stepwise cross-validation criterion based upon a modified version of `stepAIC` (see below for details)
`model.return`	a logical value indicating whether to return the list of `lm` models or not when `kernel=TRUE` (default `model.return=FALSE`)
`restarts`	integer specifying the number of times to restart the process of finding extrema of the cross-validation function (for the bandwidths only) from different (random) initial points
`random.seed`	when it is not missing and not equal to 0, the initial points will be generated using this seed when using `frscvNOMAD` or `krscvNOMAD` and `nmulti > 0`
`max.bb.eval`	argument passed to the NOMAD solver (see `snomadr` for further details)
`initial.mesh.size.real`	argument passed to the NOMAD solver (see `snomadr` for further details)
`initial.mesh.size.integer`	argument passed to the NOMAD solver (see `snomadr` for further details)
`min.mesh.size.real`	argument passed to the NOMAD solver (see `snomadr` for further details)
`min.mesh.size.integer`	arguments passed to the NOMAD solver (see `snomadr` for further details)
`min.poll.size.real`	arguments passed to the NOMAD solver (see `snomadr` for further details)
`min.poll.size.integer`	arguments passed to the NOMAD solver (see `snomadr` for further details)
`opts`	list of optional arguments to be passed to `snomadr`
`nmulti`	integer number of times to restart the process of finding extrema of the cross-validation function from different (random) initial points (default `nmulti=5`)
`tau`	if non-null a number in (0,1) denoting the quantile for which a quantile regression spline is to be estimated rather than estimating the conditional mean (default `tau=NULL`). Criterion function set by `cv.func=` are modified accordingly to admit quantile regression.
`weights`	an optional vector of weights to be used in the fitting process. Should be ‘NULL’ or a numeric vector. If non-NULL, weighted least squares is used with weights ‘weights’ (that is, minimizing ‘sum(w*e^2)’); otherwise ordinary least squares is used.
`singular.ok`	a logical value (default `singular.ok=FALSE`) that, when `FALSE`, discards singular bases during cross-validation (a check for ill-conditioned bases is performed).
`...`	optional arguments

Details

Typical usages are (see below for a list of options and also the examples at the end of this help file)

    ## Estimate the model and let the basis type be determined by
    ## cross-validation (i.e. cross-validation will determine whether to
    ## use the additive, generalized, or tensor product basis)
    
    model <- crs(y~x1+x2)
    
    ## Estimate the model for a specified degree/segment/bandwidth
    ## combination and do not run cross-validation (will use the
    ## additive basis by default)
    
    model <- crs(y~x1+factor(x2),cv="none",degree=3,segments=1,lambda=.1)
    
    ## Plot the mean and (asymptotic) error bounds

    plot(model,mean=TRUE,ci=TRUE)

    ## Plot the first partial derivative and (asymptotic) error bounds

    plot(model,deriv=1,ci=TRUE)

crs computes a regression spline estimate of a one (1) dimensional dependent variable on an r-dimensional vector of continuous and categorical (factor/ordered) predictors.

The regression spline model employs the tensor product B-spline basis matrix for a multivariate polynomial spline via the B-spline routines in the GNU Scientific Library (https://www.gnu.org/software/gsl/) and the tensor.prod.model.matrix function.

When basis="additive" the model becomes additive in nature (i.e. no interaction/tensor terms thus semiparametric not fully nonparametric).

When basis="tensor" the model uses the multivariate tensor product basis.

When kernel=FALSE the model uses indicator basis functions for the nominal/ordinal (factor/ordered) predictors rather than kernel weighting.

When kernel=TRUE the product kernel function for the discrete predictors is of the ‘Li-Racine’ type (see Li and Racine (2007) for details).

When cv="nomad", numerical search is undertaken using Nonsmooth Optimization by Mesh Adaptive Direct Search (Abramson, Audet, Couture, Dennis, Jr., and Le Digabel (2011)).

When kernel=TRUE and cv="exhaustive", numerical search is undertaken using optim and the box-constrained L-BFGS-B method (see optim for details). The user may restart the algorithm as many times as desired via the restarts argument (default restarts=0). The approach ascends from degree=0 (or segments=0) through degree.max and for each value of degree (or segments) searches for the optimal bandwidths. After the most complex model has been searched then the optimal degree/segments/lambda combination is selected. If any element of the optimal degree (or segments) vector coincides with degree.max (or segments.max) a warning is produced and the user ought to restart their search with a larger value of degree.max (or segments.max).

Note that the default plot method for a crs object provides some diagnostic measures, in particular, a) residuals versus fitted values (useful for checking the assumption that E(u|x)=0), b) a normal quantile-quantile plot which allows residuals to be assessed for normality (qqnorm), c) a scale-location plot that is useful for checking the assumption that the errors are iid and, in particular, that the variance is homogeneous, and d) ‘Cook's distance’ which computes the single-case influence function. See below for other arguments for the plot function for a crs object.

Note that setting prune=TRUE produces a final ‘pruning’ of the model via a stepwise cross-validation criterion achieved by modifying stepAIC and replacing extractAIC with extractCV throughout the function. This option may be enabled to remove potentially superfluous bases thereby improving the finite-sample efficiency of the resulting model. Note that if the cross-validation score for the pruned model is no better than that for the original model then the original model is returned with a warning to this effect. Note also that this option can only be used when kernel=FALSE.

Value

crs returns a crs object. The generic functions fitted and residuals extract (or generate) estimated values and residuals. Furthermore, the functions summary, predict, and plot (options mean=FALSE, deriv=i where i is an integer, ci=FALSE, persp.rgl=FALSE, plot.behavior=c("plot","plot-data","data"), xtrim=0.0,xq=0.5) support objects of this type. The returned object has the following components:

`fitted.values`	estimates of the regression function (conditional mean) at the sample points or evaluation points
`lwr`, `upr`	lower/upper bound for a 95% confidence interval for the `fitted.values` (conditional mean) obtained from `predict.lm` via the argument `interval="confidence"`
`residuals`	residuals computed at the sample points or evaluation points
`degree`	integer/vector specifying the degree of the B-spline basis for each dimension of the continuous `x`
`segments`	integer/vector specifying the number of segments of the B-spline basis for each dimension of the continuous `x`
`include`	integer/vector specifying whether each of the nominal/ordinal (`factor`/`ordered`) predictors `z` are included or omitted from the resulting estimate if `kernel=FALSE` (see below)
`kernel`	a logical value indicating whether kernel smoothing was used (`kernel=TRUE`) or not
`lambda`	vector of bandwidths used if `kernel=TRUE`
`call`	a symbolic description of the model
`r.squared`	coefficient of determination (Doksum and Samarov (1995))
`model.lm`	an object of ‘`class`’ ‘`lm`’ if `kernel=FALSE` or a list of objects of ‘`class`’ ‘`lm`’ if `kernel=TRUE` (accessed by `model.lm[[1]]`, `model.lm[[2]]`,...,. By way of example, if `foo` is a `crs` object and `kernel=FALSE`, then `foo$model.lm` is an object of ‘`class`’ ‘`lm`’, while objects of ‘`class`’ ‘`lm`’ return the `model.frame` in `model.lm$model` which can be accessed via `foo$model.lm$model` where `foo` is the `crs` object (the model frame `foo$model.lm$model` contains the B-spline bases underlying the estimate which might be of interest). Again by way of example, when `kernel=TRUE` then `foo$model.lm[[1]]$model` contains the model frame for the first unique combination of categorical predictors, `foo$model.lm[[2]]$model` the second and so forth (the weights will potentially differ for each model depending on the value(s) of `lambda`)
`deriv.mat`	a matrix of derivatives (or differences in levels for the categorical `z`) whose order is determined by `deriv=` in the `crs` call
`deriv.mat.lwr`	a matrix of 95% coverage lower bounds for `deriv.mat`
`deriv.mat.upr`	a matrix of 95% coverage upper bounds for `deriv.mat`
`hatvalues`	the `hatvalues` for the estimated model
`P.hat`	the kernel probability estimates corresponding to the categorical predictors in the estimated model

Usage Issues

Note that when kernel=FALSE summary supports the option sigtest=TRUE that conducts an F-test for significance for each predictor.

Author(s)

Jeffrey S. Racine racinej@mcmaster.ca

References

Abramson, M.A. and C. Audet and G. Couture and J.E. Dennis Jr. and and S. Le Digabel (2011), “The NOMAD project”. Software available at https://www.gerad.ca/nomad.

Craven, P. and G. Wahba (1979), “Smoothing Noisy Data With Spline Functions,” Numerische Mathematik, 13, 377-403.

Doksum, K. and A. Samarov (1995), “Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression,” The Annals of Statistics, 23 1443-1473.

Hurvich, C.M. and J.S. Simonoff and C.L. Tsai (1998), “Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion,” Journal of the Royal Statistical Society B, 60, 271-293.

Le Digabel, S. (2011), “Algorithm 909: NOMAD: Nonlinear Optimization With The MADS Algorithm”. ACM Transactions on Mathematical Software, 37(4):44:1-44:15.

Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.

Ma, S. and J.S. Racine and L. Yang (2015), “Spline Regression in the Presence of Categorical Predictors,” Journal of Applied Econometrics, Volume 30, 705-717.

Ma, S. and J.S. Racine (2013), “Additive Regression Splines with Irrelevant Categorical and Continuous Regressors,” Statistica Sinica, Volume 23, 515-541.

Racine, J.S. (2011), “Cross-Validated Quantile Regression Splines,” manuscript.

Examples

set.seed(42)
## Example - simulated data
n <- 1000
num.eval <- 50
x1 <- runif(n)
x2 <- runif(n)
z <- rbinom(n,1,.5)
dgp <- cos(2*pi*x1)+sin(2*pi*x2)+z
z <- factor(z)
y <- dgp + rnorm(n,sd=.5)

## Estimate a model with specified degree, segments, and bandwidth
model <- crs(y~x1+x2+z,degree=c(5,5),
                       segments=c(1,1),
                       lambda=0.1,
                       cv="none",
                       kernel=TRUE)
summary(model)

## Perspective plot
x1.seq <- seq(min(x1),max(x1),length=num.eval)
x2.seq <- seq(min(x2),max(x2),length=num.eval)
x.grid <- expand.grid(x1.seq,x2.seq)
newdata <- data.frame(x1=x.grid[,1],x2=x.grid[,2],
                      z=factor(rep(0,num.eval**2),levels=c(0,1)))
z0 <- matrix(predict(model,newdata=newdata),num.eval,num.eval)
newdata <- data.frame(x1=x.grid[,1],x2=x.grid[,2],
                      z=factor(rep(1,num.eval),levels=c(0,1)))
z1 <- matrix(predict(model,newdata=newdata),num.eval,num.eval)
zlim=c(min(z0,z1),max(z0,z1))
persp(x=x1.seq,y=x2.seq,z=z0,
      xlab="x1",ylab="x2",zlab="y",zlim=zlim,
      ticktype="detailed",      
      border="red",
      theta=45,phi=45)
par(new=TRUE)
persp(x=x1.seq,y=x2.seq,z=z1,
      xlab="x1",ylab="x2",zlab="y",zlim=zlim,
      theta=45,phi=45,
      ticktype="detailed",
      border="blue")

## Partial regression surface plot
plot(model,mean=TRUE,ci=TRUE)
## Not run: 
## A plot example where we extract the partial surfaces, confidence
## intervals etc. automatically generated by plot(mean=TRUE,...) but do
## not plot, rather save for separate use.
pdat <- plot(model,mean=TRUE,ci=TRUE,plot.behavior="data")

## Column 1 is the (evaluation) predictor ([,1]), 2-4 ([,-1]) the mean,
## lwr, and upr (note the returned value is a 'list' hence pdat[[1]] is
## data for the first predictor, pdat[[2]] the second etc). Note that
## matplot() can plot this nicely.
matplot(pdat[[1]][,1],pdat[[1]][,-1],
        xlab=names(pdat[[1]][1]),ylab=names(pdat[[1]][2]),
        lty=c(1,2,2),col=c(1,2,2),type="l")

## End(Not run)

[Package crs version 0.15-37 Index]