R: Fit a Smoothing Spline

smooth.spline {stats}

R Documentation

Fit a Smoothing Spline

Description

Fits a cubic smoothing spline to the supplied data.

Usage

smooth.spline(x, y = NULL, w = NULL, df, spar = NULL, lambda = NULL, cv = FALSE,
              all.knots = FALSE, nknots = .nknots.smspl,
              keep.data = TRUE, df.offset = 0, penalty = 1,
              control.spar = list(), tol = 1e-6 * IQR(x), keep.stuff = FALSE)

.nknots.smspl(n)

Arguments

`x`	a vector giving the values of the predictor variable, or a list or a two-column matrix specifying x and y.
`y`	responses. If `y` is missing or `NULL`, the responses are assumed to be specified by `x`, with `x` the index vector.
`w`	optional vector of weights of the same length as `x`; defaults to all 1.
`df`	the desired equivalent number of degrees of freedom (trace of the smoother matrix). Must be in `(1,n_x]`, `n_x` the number of unique x values, see below.
`spar`	smoothing parameter, typically (but not necessarily) in `(0,1]`. When `spar` is specified, the coefficient `\lambda` of the integral of the squared second derivative in the fit (penalized log likelihood) criterion is a monotone function of `spar`, see the details below. Alternatively `lambda` may be specified instead of the scale free `spar`=`s`.
`lambda`	if desired, the internal (design-dependent) smoothing parameter `\lambda` can be specified instead of `spar`. This may be desirable for resampling algorithms such as cross validation or the bootstrap.
`cv`	ordinary leave-one-out (`TRUE`) or ‘generalized’ cross-validation (GCV) when `FALSE`; is used for smoothing parameter computation only when both `spar` and `df` are not specified; it is used however to determine `cv.crit` in the result. Setting it to `NA` for speedup skips the evaluation of leverages and any score.
`all.knots`	if `TRUE`, all distinct points in `x` are used as knots. If `FALSE` (default), a subset of `x[]` is used, specifically `x[j]` where the `nknots` indices are evenly spaced in `1:n`, see also the next argument `nknots`. Alternatively, a strictly increasing `numeric` vector specifying “all the knots” to be used; must be rescaled to `[0, 1]` already such that it corresponds to the `ans $ fit$knots` sequence returned, not repeating the boundary knots.
`nknots`	integer or `function` giving the number of knots to use when `all.knots = FALSE`. If a function (as by default), the number of knots is `nknots(nx)`. By default using `.nknots.smspl()`, for `n_x > 49` this is less than `n_x`, the number of unique `x` values, see the Note.
`keep.data`	logical specifying if the input data should be kept in the result. If `TRUE` (as per default), fitted values and residuals are available from the result.
`df.offset`	allows the degrees of freedom to be increased by `df.offset` in the GCV criterion.
`penalty`	the coefficient of the penalty for degrees of freedom in the GCV criterion.
`control.spar`	optional list with named components controlling the root finding when the smoothing parameter `spar` is computed, i.e., missing or `NULL`, see below. Note that this is partly experimental and may change with general spar computation improvements! low: lower bound for `spar`; defaults to -1.5 (used to implicitly default to 0 in R versions earlier than 1.4). high: upper bound for `spar`; defaults to +1.5. tol: the absolute precision (tolerance) used; defaults to 1e-4 (formerly 1e-3). eps: the relative precision used; defaults to 2e-8 (formerly 0.00244). trace: logical indicating if iterations should be traced. maxit: integer giving the maximal number of iterations; defaults to 500. Note that `spar` is only searched for in the interval `[low, high]`.
`tol`	a tolerance for sameness or uniqueness of the `x` values. The values are binned into bins of size `tol` and values which fall into the same bin are regarded as the same. Must be strictly positive (and finite).
`keep.stuff`	an experimental `logical` indicating if the result should keep extras from the internal computations. Should allow to reconstruct the `X` matrix and more.
`n`	for `.nknots.smspl`; typically the number of unique `x` values (aka `n_x`).

Details

Neither x nor y are allowed to containing missing or infinite values.

The x vector should contain at least four distinct values. ‘Distinct’ here is controlled by tol: values which are regarded as the same are replaced by the first of their values and the corresponding y and w are pooled accordingly.

Unless lambda has been specified instead of spar, the computational \lambda used (as a function of s=spar) is \lambda = r \cdot 256^{3 s - 1} where r = tr(X' W X) / tr(\Sigma), \Sigma is the matrix given by \Sigma_{ij} = \int B_i''(t) B_j''(t) dt, X is given by X_{ij} = B_j(x_i), W is the diagonal matrix of weights (scaled such that its trace is n, the original number of observations) and B_k(.) is the k-th B-spline.

Note that with these definitions, f_i = f(x_i), and the B-spline basis representation f = X c (i.e., c is the vector of spline coefficients), the penalized log likelihood is L = (y - f)' W (y - f) + \lambda c' \Sigma c, and hence c is the solution of the (ridge regression) (X' W X + \lambda \Sigma) c = X' W y.

If spar and lambda are missing or NULL, the value of df is used to determine the degree of smoothing. If df is missing as well, leave-one-out cross-validation (ordinary or ‘generalized’ as determined by cv) is used to determine \lambda.

Note that from the above relation, spar is s = s0 + 0.0601 \cdot \log\lambda.

Note however that currently the results may become very unreliable for spar values smaller than about -1 or -2. The same may happen for values larger than 2 or so. Don't think of setting spar or the controls low and high outside such a safe range, unless you know what you are doing! Similarly, specifying lambda instead of spar is delicate, notably as the range of “safe” values for lambda is not scale-invariant and hence entirely data dependent.

The ‘generalized’ cross-validation method GCV will work correctly when there are duplicated points in x. However, it is ambiguous what leave-one-out cross-validation means with duplicated points, and the internal code uses an approximation that involves leaving out groups of duplicated points. cv = TRUE is best avoided in that case.

Value

An object of class "smooth.spline" with components

`x`	the distinct `x` values in increasing order, see the ‘Details’ above.
`y`	the fitted values corresponding to `x`.
`w`	the weights used at the unique values of `x`.
`yin`	the y values used at the unique `y` values.
`tol`	the `tol` argument (whose default depends on `x`).
`data`	only if `keep.data = TRUE`: itself a `list` with components `x`, `y` and `w` of the same length. These are the original `(x_i,y_i,w_i), i = 1, \dots, n`, values where `data$x` may have repeated values and hence be longer than the above `x` component; see details.
`n`	an integer; the (original) sample size.
`lev`	(when `cv` was not `NA`) leverages, the diagonal values of the smoother matrix.
`cv`	the `cv` argument used; i.e., `FALSE`, `TRUE`, or `NA`.
`cv.crit`	cross-validation score, ‘generalized’ or true, depending on `cv`. The CV score is often called “PRESS” (and labeled on `print()`), for ‘PREdiction Sum of Squares’. Note that this is not the same as the (CV or GCV) score which is minimized during fitting (and returned in `crit`), e.g., in the case of `nx < n` (where `nx=n_x` is the number of unique x values).
`pen.crit`	the penalized criterion, a non-negative number; simply the (weighted) residual sum of squares (RSS), `sum(.$w * residuals(.)^2)` .
`crit`	the criterion value minimized in the underlying `.Fortran` routine ‘sslvrg’. When `df` has been specified, the criterion is `3 + (tr(S_\lambda) - df)^2`, where the `3 +` is there for numerical (and historical) reasons.
`df`	equivalent degrees of freedom used. Note that (currently) this value may become quite imprecise when the true `df` is between and 1 and 2.
`spar`	the value of `spar` computed or given, unless it has been given as `c(lambda = *)`, when it set to `NA` here.
`ratio`	(when `spar` above is not `NA`), the value `r`, the ratio of two matrix traces.
`lambda`	the value of `\lambda` corresponding to `spar`, see the details above.
`iparms`	named integer(3) vector where `..$ipars["iter"]` gives number of spar computing iterations used.
`auxMat`	experimental; when `keep.stuff` was true, a “flat” numeric vector containing parts of the internal computations.
`fit`	list for use by `predict.smooth.spline`, with components knot: the knot sequence (including the repeated boundary knots), scaled into `[0, 1]` (via `min` and `range`). nk: number of coefficients or number of ‘proper’ knots plus 2. coef: coefficients for the spline basis used. min, range: numbers giving the corresponding quantities of `x`.
`call`	the matched call.

method(class = "smooth.spline") shows a hatvalues() method based on the lev vector above.

Note

The number of unique x values, \code{nx} = n_x, are determined by the tol argument, equivalently to

    nx <- length(x) - sum(duplicated( round((x - mean(x)) / tol) ))

The default all.knots = FALSE and nknots = .nknots.smspl, entails using only O({n_x}^{0.2}) knots instead of n_x for n_x > 49. This cuts speed and memory requirements, but not drastically anymore since R version 1.5.1 where it is only O(n_k) + O(n) where n_k is the number of knots.

In this case where not all unique x values are used as knots, the result is a regression spline rather than a smoothing spline in the strict sense, but very close unless a small smoothing parameter (or large df) is used.

Author(s)

R implementation by B. D. Ripley and Martin Maechler (spar/lambda, etc).

Source

This function is based on code in the GAMFIT Fortran program by T. Hastie and R. Tibshirani (originally taken from http://lib.stat.cmu.edu/general/gamfit) which makes use of spline code by Finbarr O'Sullivan. Its design parallels the smooth.spline function of Chambers & Hastie (1992).

References

Chambers, J. M. and Hastie, T. J. (1992) Statistical Models in S, Wadsworth & Brooks/Cole.

Green, P. J. and Silverman, B. W. (1994) Nonparametric Regression and Generalized Linear Models: A Roughness Penalty Approach. Chapman and Hall.

Hastie, T. J. and Tibshirani, R. J. (1990) Generalized Additive Models. Chapman and Hall.

Examples

require(graphics)
plot(dist ~ speed, data = cars, main = "data(cars)  &  smoothing splines")
cars.spl <- with(cars, smooth.spline(speed, dist))
cars.spl
## This example has duplicate points, so avoid cv = TRUE

lines(cars.spl, col = "blue")
ss10 <- smooth.spline(cars[,"speed"], cars[,"dist"], df = 10)
lines(ss10, lty = 2, col = "red")
legend(5,120,c(paste("default [C.V.] => df =",round(cars.spl$df,1)),
               "s( * , df = 10)"), col = c("blue","red"), lty = 1:2,
       bg = 'bisque')


## Residual (Tukey Anscombe) plot:
plot(residuals(cars.spl) ~ fitted(cars.spl))
abline(h = 0, col = "gray")

## consistency check:
stopifnot(all.equal(cars$dist,
                    fitted(cars.spl) + residuals(cars.spl)))
## The chosen inner knots in original x-scale :
with(cars.spl$fit, min + range * knot[-c(1:3, nk+1 +1:3)]) # == unique(cars$speed)

## Visualize the behavior of  .nknots.smspl()
nKnots <- Vectorize(.nknots.smspl) ; c.. <- adjustcolor("gray20",.5)
curve(nKnots, 1, 250, n=250)
abline(0,1, lty=2, col=c..); text(90,90,"y = x", col=c.., adj=-.25)
abline(h=100,lty=2); abline(v=200, lty=2)

n <- c(1:799, seq(800, 3490, by=10), seq(3500, 10000, by = 50))
plot(n, nKnots(n), type="l", main = "Vectorize(.nknots.smspl) (n)")
abline(0,1, lty=2, col=c..); text(180,180,"y = x", col=c..)
n0 <- c(50, 200, 800, 3200); c0 <- adjustcolor("blue3", .5)
lines(n0, nKnots(n0), type="h", col=c0)
axis(1, at=n0, line=-2, col.ticks=c0, col=NA, col.axis=c0)
axis(4, at=.nknots.smspl(10000), line=-.5, col=c..,col.axis=c.., las=1)

##-- artificial example
y18 <- c(1:3, 5, 4, 7:3, 2*(2:5), rep(10, 4))
xx  <- seq(1, length(y18), length.out = 201)
(s2   <- smooth.spline(y18)) # GCV
(s02  <- smooth.spline(y18, spar = 0.2))
(s02. <- smooth.spline(y18, spar = 0.2, cv = NA))
plot(y18, main = deparse(s2$call), col.main = 2)
lines(s2, col = "gray"); lines(predict(s2, xx), col = 2)
lines(predict(s02, xx), col = 3); mtext(deparse(s02$call), col = 3)

## Specifying 'lambda' instead of usual spar :
(s2. <- smooth.spline(y18, lambda = s2$lambda, tol = s2$tol))



## The following shows the problematic behavior of 'spar' searching:
(s2  <- smooth.spline(y18, control =
                      list(trace = TRUE, tol = 1e-6, low = -1.5)))
(s2m <- smooth.spline(y18, cv = TRUE, control =
                      list(trace = TRUE, tol = 1e-6, low = -1.5)))
## both above do quite similarly (Df = 8.5 +- 0.2)

[Package stats version 4.4.1 Index]