bigspline {bigsplines}R Documentation

Fits Smoothing Spline

Description

Given a real-valued response vector \mathbf{y}=\{y_{i}\}_{n\times1} and a real-valued predictor vector \mathbf{x}=\{x_{i}\}_{n\times 1} with a \leq x_{i} \leq b \ \forall i, a smoothing spline model has the form

y_{i}=\eta(x_{i})+e_{i}

where y_{i} is the i-th observation's respone, x_{i} is the i-th observation's predictor, \eta is an unknown smooth function relating the response and predictor, and e_{i}\sim\mathrm{N}(0,\sigma^{2}) is iid Gaussian error.

Usage

bigspline(x,y,type="cub",nknots=30,rparm=0.01,xmin=min(x),
          xmax=max(x),alpha=1,lambdas=NULL,se.fit=FALSE,
          rseed=1234,knotcheck=TRUE)

Arguments

x

Predictor vector.

y

Response vector. Must be same length as x.

type

Type of spline for x. Options include type="lin" for linear, type="cub" for cubic, type="cub0" for different cubic, and type="per" for cubic periodic. See Spline Types section.

nknots

Scalar giving maximum number of knots to bin-sample. Use more knots for more jagged functions.

rparm

Rounding parameter for x. Use rparm=NA to fit unrounded solution. Rounding parameter must be in interval (0,1].

xmin

Minimum x value (i.e., a). Used to transform data to interval [0,1].

xmax

Maximum x value (i.e., b). Used to transform data to interval [0,1].

alpha

Manual tuning parameter for GCV score. Using alpha=1 gives unbaised esitmate. Using a larger alpha enforces a smoother estimate.

lambdas

Vector of global smoothing parameters to try. Default estimates smoothing parameter that minimizes GCV score.

se.fit

Logical indicating if the standard errors of fitted values should be estimated.

rseed

Random seed. Input to set.seed to reproduce same knots when refitting same model. Use rseed=NULL to generate a different sample of knots each time.

knotcheck

If TRUE, only unique knots are used (for stability).

Details

To estimate \eta I minimize the penalized least-squares functional

\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\eta(x_{i}))^{2}+\lambda \int [\ddot{\eta}(x)]^2 dx

where \ddot{\eta} denotes the second derivative of \eta and \lambda\geq0 is a smoothing parameter that controls the trade-off between fitting and smoothing the data.

Default use of the function estimates \lambda by minimizing the GCV score:

\mbox{GCV}(\lambda) = \frac{n\|(\mathbf{I}_{n}-\mathbf{S}_{\lambda})\mathbf{y}\|^{2}}{[n-\mathrm{tr}(\mathbf{S}_{\lambda})]^2}

where \mathbf{I}_{n} is the identity matrix and \mathbf{S}_{\lambda} is the smoothing matrix (see Computational Details).

Using the rounding parameter input rparm can greatly speed-up and stabilize the fitting for large samples. When rparm is used, the spline is fit to a set of unique data points after rounding; the unique points are determined using the efficient algorithm described in Helwig (2013). For typical cases, I recommend using rparm=0.01, but smaller rounding parameters (e,g., rparm=0.001) may be needed for particularly jagged functions (or when x has outliers).

Value

fitted.values

Vector of fitted values corresponding to the original data points in x (if rparm=NA) or the rounded data points in xunique (if rparm is used).

se.fit

Vector of standard errors of fitted.values (if input se.fit=TRUE).

x

Predictor vector (same as input).

y

Response vector (same as input).

type

Type of spline that was used.

xunique

Unique elements of x after rounding (if rparm is used).

yunique

Mean of y for unique elements of x after rounding (if rparm is used).

funique

Vector giving frequency of each element of xunique (if rparm is used).

sigma

Estimated error standard deviation, i.e., \hat{\sigma}.

ndf

Data frame with two elements: n is total sample size, and df is effective degrees of freedom of fit model (trace of smoothing matrix).

info

Model fit information: vector containing the GCV, multiple R-squared, AIC, and BIC of fit model (assuming Gaussian error).

xrng

Predictor range: xrng=c(xmin,xmax).

myknots

Bin-sampled spline knots used for fit.

rparm

Rounding parameter for x (same as input).

lambda

Optimal smoothing parameter.

coef

Spline basis function coefficients.

coef.csqrt

Matrix square-root of covariace matrix of coef. Use tcrossprod(coef.csqrt) to get covariance matrix of coef.

Warnings

Cubic and cubic periodic splines transform the predictor to the interval [0,1] before fitting. So input xmin must be less than or equal to min(x), and input xmax must be greater than or equal to max(x).

When using rounding parameters, output fitted.values corresponds to unique rounded predictor scores in output xunique. Use predict.bigspline function to get fitted values for full y vector.

Computational Details

According to smoothing spline theory, the function \eta can be approximated as

\eta(x) = d_{0} + d_{1}\phi_{1}(x) + \sum_{h=1}^{q}c_{h}\rho(x,x_{h}^{*})

where the \phi_{1} is a linear function, \rho is the reproducing kernel of the contrast (nonlinear) space, and \{x_{h}^{*}\}_{h=1}^{q} are the selected spline knots.

This implies that the penalized least-squares functional can be rewritten as

\|\mathbf{y} - \mathbf{K}\mathbf{d} - \mathbf{J}\mathbf{c}\|^{2} + n\lambda\mathbf{c}'\mathbf{Q}\mathbf{c}

where \mathbf{K}=\{\phi(x_{i})\}_{n \times 2} is the null space basis function matrix, \mathbf{J}=\{\rho(x_{i},x_{h}^{*})\}_{n \times q} is the contrast space basis funciton matrix, \mathbf{Q}=\{\rho(x_{g}^{*},x_{h}^{*})\}_{q \times q} is the penalty matrix, and \mathbf{d}=(d_{0},d_{1})' and \mathbf{c}=(c_{1},\ldots,c_{q})' are the unknown basis function coefficients.

Given the smoothing parameter \lambda, the optimal basis function coefficients have the form

\left(\begin{array}{cc} \hat{\mathbf{d}} \\ \hat{\mathbf{c}} \end{array}\right) = \left(\begin{array}{cc} \mathbf{K'K} & \mathbf{K}'\mathbf{J} \\ \mathbf{J}'\mathbf{K} & \mathbf{J}'\mathbf{J} + n\lambda\mathbf{Q} \end{array}\right)^{\dagger} \left(\begin{array}{c} \mathbf{K}' \\ \mathbf{J}' \end{array}\right)\mathbf{y}

where (\cdot)^{\dagger} denotes the pseudoinverse of the input matrix.

Given the optimal coefficients, the fitted values are given by \hat{\mathbf{y}} = \mathbf{K}\hat{\mathbf{d}}+\mathbf{J}\hat{\mathbf{c}} = \mathbf{S}_{\lambda}\mathbf{y}, where

\mathbf{S}_{\lambda} = \left(\begin{array}{cc} \mathbf{K} & \mathbf{J} \end{array}\right) \left(\begin{array}{cc} \mathbf{K'K} & \mathbf{K}'\mathbf{J} \\ \mathbf{J}'\mathbf{K} & \mathbf{J}'\mathbf{J} + n\lambda\mathbf{Q} \end{array}\right)^{\dagger} \left(\begin{array}{c} \mathbf{K}' \\ \mathbf{J}' \end{array}\right)

is the smoothing matrix, which depends on \lambda.

Spline Types

For a linear spline (type="lin") with x \in [0,1], the needed functions are

\phi_{1}(x) = 0 \qquad \mbox{and} \qquad \rho(x,z) = k_{1}(x)k_{1}(z)+k_{2}(|x-z|)

where k_{1}(x)=x-0.5, k_{2}(x)=\frac{1}{2}\left(k_{1}^{2}(x) - \frac{1}{12} \right); in this case \mathbf{K}=\mathbf{1}_{n} and \mathbf{d}=d_{0}.

For a cubic spline (type="cub") with x \in [0,1], the needed functions are

\phi_{1}(x) = k_{1}(x) \qquad \mbox{and} \qquad \rho(x,z) = k_{2}(x)k_{2}(z)-k_{4}(|x-z|)

where k_{1} and k_{2} are defined above, and k_{4}(x)=\frac{1}{24}\left(k_{1}^{4}(x) - \frac{k_{1}^{2}(x)}{2} + \frac{7}{240} \right).

For a different cubic spline (type="cub0") with x \in [0,1], the needed functions are

\phi_{1}(x) = x \qquad \mbox{and} \qquad \rho(x,z) = (x \wedge z)^2[3(x \vee z) - (x \wedge z)]/6

where (x \wedge z) = \min(x,z) and (x \vee z) = \max(x,z).

Note that type="cub" and type="cub0" use different definitions of the averaging operator in the null space. The overall spline estimates should be the same (up to approximation accuracy), but the null and constrast space effect functions will differ (see predict.bigspline). See Helwig (2013) and Gu (2013) for a further discussion of polynomial splines.

For a periodic cubic spline (type="per") with x \in [0,1], the needed functions are

\phi_{1}(x) = 0 \qquad \mbox{and} \qquad \rho(x,z) = -k_{4}(|x-z|)

where k_{4}(x) is defined as it was for type="cub"; in this case \mathbf{K}=\mathbf{1}_{n} and \mathbf{d}=d_{0}.

Note

The spline is estimated using penalized least-squares, which does not require the Gaussian error assumption. However, the spline inference information (e.g., standard errors and fit information) requires the Gaussian error assumption.

Author(s)

Nathaniel E. Helwig <helwig@umn.edu>

References

Gu, C. (2013). Smoothing spline ANOVA models, 2nd edition. New York: Springer.

Helwig, N. E. (2013). Fast and stable smoothing spline analysis of variance models for large samples with applications to electroencephalography data analysis. Unpublished doctoral dissertation. University of Illinois at Urbana-Champaign.

Helwig, N. E. and Ma, P. (2015). Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. Journal of Computational and Graphical Statistics, 24, 715-732.

Helwig, N. E. and Ma, P. (2016). Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Statistics and Its Interface, 9, 433-444.

Examples


##########   EXAMPLE 1   ##########

# define relatively smooth function
set.seed(773)
myfun <- function(x){ sin(2*pi*x) }
x <- runif(10^6)
y <- myfun(x) + rnorm(10^6)

# linear, cubic, different cubic, and periodic splines
linmod <- bigspline(x,y,type="lin")
linmod
cubmod <- bigspline(x,y)
cubmod
cub0mod <- bigspline(x,y,type="cub0")
cub0mod
permod <- bigspline(x,y,type="per")
permod


##########   EXAMPLE 2   ##########

# define more jagged function
set.seed(773)
myfun <- function(x){ 2*x + cos(4*pi*x) }
x <- runif(10^6)*4
y <- myfun(x) + rnorm(10^6)

# try different numbers of knots
r1mod <- bigspline(x,y,nknots=20)
crossprod( myfun(r1mod$xunique) - r1mod$fitted )/length(r1mod$fitted)
r2mod <- bigspline(x,y,nknots=30)
crossprod( myfun(r2mod$xunique) - r2mod$fitted )/length(r2mod$fitted)
r3mod <- bigspline(x,y,nknots=40)
crossprod( myfun(r3mod$xunique) - r3mod$fitted )/length(r3mod$fitted)


##########   EXAMPLE 3   ##########

# define more jagged function
set.seed(773)
myfun <- function(x){ 2*x + cos(4*pi*x) }
x <- runif(10^6)*4
y <- myfun(x) + rnorm(10^6)

# try different rounding parameters
r1mod <- bigspline(x,y,rparm=0.05)
crossprod( myfun(r1mod$xunique) - r1mod$fitted )/length(r1mod$fitted)
r2mod <- bigspline(x,y,rparm=0.02)
crossprod( myfun(r2mod$xunique) - r2mod$fitted )/length(r2mod$fitted)
r3mod <- bigspline(x,y,rparm=0.01)
crossprod( myfun(r3mod$xunique) - r3mod$fitted )/length(r3mod$fitted)


[Package bigsplines version 1.1-1 Index]