crs {crs} | R Documentation |
Categorical Regression Splines
Description
crs
computes a regression spline estimate of a
one (1) dimensional dependent variable on an r
-dimensional
vector of continuous and categorical
(factor
/ordered
) predictors (Ma and
Racine (2013), Ma, Racine and Yang (2015)).
Usage
crs(...)
## Default S3 method:
crs(xz,
y,
degree = NULL,
segments = NULL,
include = NULL,
kernel = TRUE,
lambda = NULL,
complexity = c("degree-knots","degree","knots"),
knots = c("quantiles","uniform","auto"),
basis = c("auto","additive","tensor","glp"),
deriv = 0,
data.return = FALSE,
prune = FALSE,
model.return = FALSE,
tau = NULL,
weights = NULL,
...)
## S3 method for class 'formula'
crs(formula,
data = list(),
degree = NULL,
segments = NULL,
include = NULL,
degree.max = 10,
segments.max = 10,
degree.min = 0,
segments.min = 1,
cv.df.min = 1,
cv = c("nomad","exhaustive","none"),
cv.threshold = 1000,
cv.func = c("cv.ls","cv.gcv","cv.aic"),
kernel = TRUE,
lambda = NULL,
lambda.discrete = FALSE,
lambda.discrete.num = 100,
complexity = c("degree-knots","degree","knots"),
knots = c("quantiles","uniform","auto"),
basis = c("auto","additive","tensor","glp"),
deriv = 0,
data.return = FALSE,
prune = FALSE,
model.return = FALSE,
restarts = 0,
random.seed = 42,
max.bb.eval = 10000,
initial.mesh.size.real = "r1.0e-01",
initial.mesh.size.integer = "1",
min.mesh.size.real = paste(sqrt(.Machine$double.eps)),
min.mesh.size.integer = 1,
min.poll.size.real = 1,
min.poll.size.integer = 1,
opts=list(),
nmulti = 5,
tau = NULL,
weights = NULL,
singular.ok = FALSE,
...)
Arguments
xz |
numeric ( |
y |
a numeric vector of responses. |
degree |
integer/vector specifying the polynomial degree of the
B-spline basis for each dimension of the continuous |
segments |
integer/vector specifying the number of segments of the
B-spline basis for each dimension of the continuous |
include |
integer/vector specifying whether each of the
nominal/ordinal ( |
lambda |
a vector of bandwidths for each dimension of the
categorical |
lambda.discrete |
if |
lambda.discrete.num |
a positive integer indicating the number of
discrete values that lambda can assume - this parameter will only be
used when |
formula |
a symbolic description of the model to be fit |
data |
an optional data frame containing the variables in the model |
cv |
a character string (default |
cv.threshold |
an integer (default |
cv.func |
a character string (default |
kernel |
a logical value (default |
degree.max |
the maximum degree of the B-spline basis for
each of the continuous predictors (default |
segments.max |
the maximum segments of the B-spline basis for
each of the continuous predictors (default |
degree.min |
the minimum degree of the B-spline basis for
each of the continuous predictors (default |
segments.min |
the minimum segments of the B-spline basis for
each of the continuous predictors (default |
cv.df.min |
the minimum degrees of freedom to allow when
conducting NOMAD-based cross-validation (default
|
complexity |
a character string (default
For the continuous predictors the regression spline model employs
either the additive or tensor product B-spline basis matrix for a
multivariate polynomial spline via the B-spline routines in the GNU
Scientific Library (https://www.gnu.org/software/gsl/) and the
|
knots |
a character string (default |
basis |
a character string (default |
deriv |
an integer |
data.return |
a logical value indicating whether to return
|
prune |
a logical value (default |
model.return |
a logical value indicating whether to return the
list of |
restarts |
integer specifying the number of times to restart the process of finding extrema of the cross-validation function (for the bandwidths only) from different (random) initial points |
random.seed |
when it is not missing and not equal to 0, the
initial points will be generated using this seed when using
|
max.bb.eval |
argument passed to the NOMAD solver (see |
initial.mesh.size.real |
argument passed to the NOMAD solver (see |
initial.mesh.size.integer |
argument passed to the NOMAD solver (see |
min.mesh.size.real |
argument passed to the NOMAD solver (see |
min.mesh.size.integer |
arguments passed to the NOMAD solver (see |
min.poll.size.real |
arguments passed to the NOMAD solver (see |
min.poll.size.integer |
arguments passed to the NOMAD solver (see |
opts |
list of optional arguments to be passed to
|
nmulti |
integer number of times to restart the process of finding extrema of
the cross-validation function from different (random) initial
points (default |
tau |
if non-null a number in (0,1) denoting the quantile for which a quantile
regression spline is to be estimated rather than estimating the
conditional mean (default |
weights |
an optional vector of weights to be used in the fitting process. Should be ‘NULL’ or a numeric vector. If non-NULL, weighted least squares is used with weights ‘weights’ (that is, minimizing ‘sum(w*e^2)’); otherwise ordinary least squares is used. |
singular.ok |
a logical value (default |
... |
optional arguments |
Details
Typical usages are (see below for a list of options and also the examples at the end of this help file)
## Estimate the model and let the basis type be determined by ## cross-validation (i.e. cross-validation will determine whether to ## use the additive, generalized, or tensor product basis) model <- crs(y~x1+x2) ## Estimate the model for a specified degree/segment/bandwidth ## combination and do not run cross-validation (will use the ## additive basis by default) model <- crs(y~x1+factor(x2),cv="none",degree=3,segments=1,lambda=.1) ## Plot the mean and (asymptotic) error bounds plot(model,mean=TRUE,ci=TRUE) ## Plot the first partial derivative and (asymptotic) error bounds plot(model,deriv=1,ci=TRUE)
crs
computes a regression spline estimate of a one (1)
dimensional dependent variable on an r
-dimensional vector of
continuous and categorical
(factor
/ordered
) predictors.
The regression spline model employs the tensor product B-spline basis
matrix for a multivariate polynomial spline via the B-spline routines
in the GNU Scientific Library (https://www.gnu.org/software/gsl/)
and the tensor.prod.model.matrix
function.
When basis="additive"
the model becomes additive in nature
(i.e. no interaction/tensor terms thus semiparametric not fully
nonparametric).
When basis="tensor"
the model uses the multivariate tensor
product basis.
When kernel=FALSE
the model uses indicator basis functions for
the nominal/ordinal (factor
/ordered
)
predictors rather than kernel weighting.
When kernel=TRUE
the product kernel function for the discrete
predictors is of the ‘Li-Racine’ type (see Li and Racine (2007)
for details).
When cv="nomad"
, numerical search is undertaken using Nonsmooth
Optimization by Mesh Adaptive Direct Search (Abramson, Audet, Couture,
Dennis, Jr., and Le Digabel (2011)).
When kernel=TRUE
and cv="exhaustive"
, numerical search
is undertaken using optim
and the box-constrained
L-BFGS-B
method (see optim
for details). The user
may restart the algorithm as many times as desired via the
restarts
argument (default restarts=0
). The approach
ascends from degree=0
(or segments=0
) through
degree.max
and for each value of degree
(or
segments
) searches for the optimal bandwidths. After the most
complex model has been searched then the optimal
degree
/segments
/lambda
combination is
selected. If any element of the optimal degree
(or
segments
) vector coincides with degree.max
(or segments.max
) a warning
is produced and the user ought to restart their search with a larger
value of degree.max
(or segments.max
).
Note that the default plot
method for a crs
object
provides some diagnostic measures, in particular, a) residuals versus
fitted values (useful for checking the assumption that
E(u|x)=0
), b) a normal quantile-quantile plot which allows
residuals to be assessed for normality (qqnorm
), c) a
scale-location plot that is useful for checking the assumption that
the errors are iid and, in particular, that the variance is
homogeneous, and d) ‘Cook's distance’ which computes the
single-case influence function. See below for other arguments for the
plot function for a crs
object.
Note that setting prune=TRUE
produces a final ‘pruning’
of the model via a stepwise cross-validation criterion achieved by
modifying stepAIC
and replacing extractAIC
with extractCV
throughout the function. This option may be
enabled to remove potentially superfluous bases thereby improving the
finite-sample efficiency of the resulting model. Note that if the
cross-validation score for the pruned model is no better than that for
the original model then the original model is returned with a warning
to this effect. Note also that this option can only be used when
kernel=FALSE
.
Value
crs
returns a crs
object. The generic functions
fitted
and residuals
extract (or
generate) estimated values and residuals. Furthermore, the functions
summary
, predict
, and plot
(options mean=FALSE
, deriv=i
where i
is an
integer, ci=FALSE
, persp.rgl=FALSE
,
plot.behavior=c("plot","plot-data","data")
,
xtrim=0.0
,xq=0.5
) support objects of this type. The
returned object has the following components:
fitted.values |
estimates of the regression function (conditional mean) at the sample points or evaluation points |
lwr , upr |
lower/upper bound for a 95% confidence interval for
the |
residuals |
residuals computed at the sample points or evaluation points |
degree |
integer/vector specifying the degree of the B-spline
basis for each dimension of the continuous |
segments |
integer/vector specifying the number of segments of
the B-spline basis for each dimension of the continuous |
include |
integer/vector specifying whether each of the
nominal/ordinal ( |
kernel |
a logical value indicating whether kernel smoothing was
used ( |
lambda |
vector of bandwidths used if |
call |
a symbolic description of the model |
r.squared |
coefficient of determination (Doksum and Samarov (1995)) |
model.lm |
an object of ‘ |
deriv.mat |
a matrix of derivatives (or differences in levels
for the categorical |
deriv.mat.lwr |
a matrix of 95% coverage lower bounds for
|
deriv.mat.upr |
a matrix of 95% coverage upper bounds for
|
hatvalues |
the |
P.hat |
the kernel probability estimates corresponding to the categorical predictors in the estimated model |
Usage Issues
Note that when kernel=FALSE
summary
supports the
option sigtest=TRUE
that conducts an F-test for significance
for each predictor.
Author(s)
Jeffrey S. Racine racinej@mcmaster.ca
References
Abramson, M.A. and C. Audet and G. Couture and J.E. Dennis Jr. and and S. Le Digabel (2011), “The NOMAD project”. Software available at https://www.gerad.ca/nomad.
Craven, P. and G. Wahba (1979), “Smoothing Noisy Data With Spline Functions,” Numerische Mathematik, 13, 377-403.
Doksum, K. and A. Samarov (1995), “Nonparametric Estimation of Global Functionals and a Measure of the Explanatory Power of Covariates in Regression,” The Annals of Statistics, 23 1443-1473.
Hurvich, C.M. and J.S. Simonoff and C.L. Tsai (1998), “Smoothing Parameter Selection in Nonparametric Regression Using an Improved Akaike Information Criterion,” Journal of the Royal Statistical Society B, 60, 271-293.
Le Digabel, S. (2011), “Algorithm 909: NOMAD: Nonlinear Optimization With The MADS Algorithm”. ACM Transactions on Mathematical Software, 37(4):44:1-44:15.
Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press.
Ma, S. and J.S. Racine and L. Yang (2015), “Spline Regression in the Presence of Categorical Predictors,” Journal of Applied Econometrics, Volume 30, 705-717.
Ma, S. and J.S. Racine (2013), “Additive Regression Splines with Irrelevant Categorical and Continuous Regressors,” Statistica Sinica, Volume 23, 515-541.
Racine, J.S. (2011), “Cross-Validated Quantile Regression Splines,” manuscript.
See Also
Examples
set.seed(42)
## Example - simulated data
n <- 1000
num.eval <- 50
x1 <- runif(n)
x2 <- runif(n)
z <- rbinom(n,1,.5)
dgp <- cos(2*pi*x1)+sin(2*pi*x2)+z
z <- factor(z)
y <- dgp + rnorm(n,sd=.5)
## Estimate a model with specified degree, segments, and bandwidth
model <- crs(y~x1+x2+z,degree=c(5,5),
segments=c(1,1),
lambda=0.1,
cv="none",
kernel=TRUE)
summary(model)
## Perspective plot
x1.seq <- seq(min(x1),max(x1),length=num.eval)
x2.seq <- seq(min(x2),max(x2),length=num.eval)
x.grid <- expand.grid(x1.seq,x2.seq)
newdata <- data.frame(x1=x.grid[,1],x2=x.grid[,2],
z=factor(rep(0,num.eval**2),levels=c(0,1)))
z0 <- matrix(predict(model,newdata=newdata),num.eval,num.eval)
newdata <- data.frame(x1=x.grid[,1],x2=x.grid[,2],
z=factor(rep(1,num.eval),levels=c(0,1)))
z1 <- matrix(predict(model,newdata=newdata),num.eval,num.eval)
zlim=c(min(z0,z1),max(z0,z1))
persp(x=x1.seq,y=x2.seq,z=z0,
xlab="x1",ylab="x2",zlab="y",zlim=zlim,
ticktype="detailed",
border="red",
theta=45,phi=45)
par(new=TRUE)
persp(x=x1.seq,y=x2.seq,z=z1,
xlab="x1",ylab="x2",zlab="y",zlim=zlim,
theta=45,phi=45,
ticktype="detailed",
border="blue")
## Partial regression surface plot
plot(model,mean=TRUE,ci=TRUE)
## Not run:
## A plot example where we extract the partial surfaces, confidence
## intervals etc. automatically generated by plot(mean=TRUE,...) but do
## not plot, rather save for separate use.
pdat <- plot(model,mean=TRUE,ci=TRUE,plot.behavior="data")
## Column 1 is the (evaluation) predictor ([,1]), 2-4 ([,-1]) the mean,
## lwr, and upr (note the returned value is a 'list' hence pdat[[1]] is
## data for the first predictor, pdat[[2]] the second etc). Note that
## matplot() can plot this nicely.
matplot(pdat[[1]][,1],pdat[[1]][,-1],
xlab=names(pdat[[1]][1]),ylab=names(pdat[[1]][2]),
lty=c(1,2,2),col=c(1,2,2),type="l")
## End(Not run)