R: Fits Generalized Smoothing Spline ANOVA Models

bigssg {bigsplines}

R Documentation

Fits Generalized Smoothing Spline ANOVA Models

Description

Given an exponential family response vector \mathbf{y}=\{y_{i}\}_{n\times1}, a Generalized Smoothing Spline Anova (GSSA) has the form

g(\mu_{i}) = \eta(\mathbf{x}_{i})

where \mu_{i} is the expected value of the i-th observation's respone, g(\cdot) is some invertible link function, \mathbf{x}_{i}=(x_{i1},\ldots,x_{ip}) is the i-th observation's nonparametric predictor vector, and \eta is an unknown smooth function relating the response and nonparametric predictors. Function can fit additive models, and also allows for 2-way and 3-way interactions between any number of predictors. Response can be one of five non-Gaussian distributions: Binomial, Poisson, Gamma, Inverse Gaussian, or Negative Binomial (see Details and Examples).

Usage

bigssg(formula,family,data=NULL,type=NULL,nknots=NULL,rparm=NA,
       lambdas=NULL,skip.iter=TRUE,se.lp=FALSE,rseed=1234,
       gcvopts=NULL,knotcheck=TRUE,gammas=NULL,weights=NULL,
       gcvtype=c("acv","gacv","gacv.old"))

Arguments

`formula`	An object of class "`formula`": a symbolic description of the model to be fitted (see Details and Examples for more information).
`family`	Distribution for response. One of five options: `"binomial"`, `"poisson"`, `"Gamma"`, `"inverse.gaussian"`, or `"negbin"`. See Response section.
`data`	Optional data frame, list, or environment containing the variables in `formula`. Or an object of class "makessg", which is output from `makessg`.
`type`	List of smoothing spline types for predictors in `formula` (see Details). Options include `type="cub"` for cubic, `type="cub0"` for another cubic, `type="per"` for cubic periodic, `type="tps"` for cubic thin-plate, `type="ord"` for ordinal, and `type="nom"` for nominal.
`nknots`	Two possible options: (a) scalar giving total number of random knots to sample, or (b) vector indexing which rows of `data` to use as knots.
`rparm`	List of rounding parameters for each predictor. See Details.
`lambdas`	Vector of global smoothing parameters to try. Default `lambdas=10^-c(9:0)`.
`skip.iter`	Logical indicating whether to skip the iterative smoothing parameter update. Using `skip.iter=FALSE` should provide a more optimal solution, but the fitting time may be substantially longer. See Skip Iteration section.
`se.lp`	Logical indicating if the standard errors of the linear predictors (`\eta`) should be estimated.
`rseed`	Random seed for knot sampling. Input is ignored if `nknots` is an input vector of knot indices. Set `rseed=NULL` to obtain a different knot sample each time, or set `rseed` to any positive integer to use a different seed than the default.
`gcvopts`	Control parameters for optimization. List with 6 elements: (i) `maxit`: maximum number of outer iterations, (ii) `gcvtol`: covergence tolerance for iterative GACV update, (iii) `alpha`: tuning parameter for GACV minimization, (iv) `inmaxit`: maximum number of inner iterations for iteratively reweighted fitting, (v) `intol`: inner convergence tolerance for iteratively reweighted fitting, and (vi) `insub`: number of data points to subsample when checking inner convergence. `gcvopts=list(maxit=5,gcvtol=10^-5,alpha=1,inmaxit=100,intol=10^-5,insub=10^4)`
`knotcheck`	If `TRUE`, only unique knots are used (for stability).
`gammas`	List of initial smoothing parameters for each predictor. See Details.
`weights`	Vector of positive weights for fitting (default is vector of ones).
`gcvtype`	Cross-validation criterion for selecting smoothing parameters (see Details).

Details

The formula syntax is similar to that used in lm and many other R regression functions. Use y~x to predict the response y from the predictor x. Use y~x1+x2 to fit an additive model of the predictors x1 and x2, and use y~x1*x2 to fit an interaction model. The syntax y~x1*x2 includes the interaction and main effects, whereas the syntax y~x1:x2 is not supported. See Computational Details for specifics about how nonparametric effects are estimated.

See bigspline for definitions of type="cub", type="cub0", and type="per" splines, which can handle one-dimensional predictors. See Appendix of Helwig and Ma (2015) for information about type="tps" and type="nom" splines. Note that type="tps" can handle one-, two-, or three-dimensional predictors. I recommend using type="cub" if the predictor scores have no extreme outliers; when outliers are present, type="tps" may produce a better result.

Using the rounding parameter input rparm can greatly speed-up and stabilize the fitting for large samples. For typical cases, I recommend using rparm=0.01 for cubic and periodic splines, but smaller rounding parameters may be needed for particularly jagged functions. For thin-plate splines, the data are NOT transformed to the interval [0,1] before fitting, so rounding parameter should be on raw data scale. Also, for type="tps" you can enter one rounding parameter for each predictor dimension. Use rparm=1 for ordinal and nominal splines.

Value

`fitted.values`	Vector of fitted values (data scale) corresponding to the original data points in `xvars` (if `rparm=NA`) or the rounded data points in `xunique` (if `rparm` is used).
`linear.predictors`	Vector of fitted values (link scale) corresponding to the original data points in `xvars` (if `rparm=NA`) or the rounded data points in `xunique` (if `rparm` is used).
`se.lp`	Vector of standard errors of `linear.predictors` (if input `se.lp=TRUE)`.
`yvar`	Response vector.
`xvars`	List of predictors.
`type`	Type of smoothing spline that was used for each predictor.
`yunique`	Mean of `yvar` for unique points after rounding (if `rparm` is used).
`xunique`	Unique rows of `xvars` after rounding (if `rparm` is used).
`dispersion`	Estimated dispersion parameter (see Response section).
`ndf`	Data frame with two elements: `n` is total sample size, and `df` is effective degrees of freedom of fit model (trace of smoothing matrix).
`info`	Model fit information: vector containing the GCV, multiple R-squared, AIC, and BIC of fit model.
`modelspec`	List containing specifics of fit model (needed for prediction).
`converged`	Convergence status: `converged=TRUE` if iterative update converged, `converged=FALSE` if iterative update failed to converge, and `converged=NA` if option `skip.iter=TRUE` was used.
`tnames`	Names of the terms in model.
`family`	Distribution family (same as input).
`call`	Called model in input `formula`.

Warnings

Cubic and cubic periodic splines transform the predictor to the interval [0,1] before fitting.

When using rounding parameters, output fitted.values corresponds to unique rounded predictor scores in output xunique. Use predict.bigssg function to get fitted values for full yvar vector.

Response

Only one link is permitted for each family:

family="binomial" Logit link. Response should be vector of proportions in the interval [0,1]. If response is a sample proportion, the total count should be input through weights argument.

family="poisson" Log link. Response should be vector of counts (non-negative integers).

family="Gamma" Inverse link. Response should be vector of positive real-valued data. Estimated dispersion parameter is the inverse of the shape parameter, so that the variance of the response increases as dispersion increases.

family="inverse.gaussian" Inverse-square link. Response should be vector of positive real-valued data. Estimated dispersion parameter is the inverse of the shape parameter, so that the variance of the response increases as dispersion increases.

family="negbin" Log link. Response should be vector of counts (non-negative integers). Estimated dispersion parameter is the inverse of the size parameter, so that the variance of the response increases as dispersion increases.

family=list("negbin",2) Log link. Response should be vector of counts (non-negative integers). Second element is the known (common) dispersion parameter (2 in this case). The input dispersion parameter should be the inverse of the size parameter, so that the variance of the response increases as dispersion increases.

Computational Details

To estimate \eta I minimize the (negative of the) penalized log likelihood

-\frac{1}{n}\sum_{i=1}^{n}\left\{y_{i}\eta(\mathbf{x}_{i}) - b(\eta(\mathbf{x}_{i})) \right\} + \frac{\lambda}{2} J(\eta)

where J(\cdot) is a nonnegative penalty functional quantifying the roughness of \eta and \lambda>0 is a smoothing parameter controlling the trade-off between fitting and smoothing the data. Note that for p>1 nonparametric predictors, there are additional \theta_{k} smoothing parameters embedded in J.

Following standard exponential family theory, \mu_{i} = \dot{b}(\eta(\mathbf{x}_{i})) and v_{i} = \ddot{b}(\eta(\mathbf{x}_{i}))a(\xi), where \dot{b}(\cdot) and \ddot{b}(\cdot) denote the first and second derivatives of b(\cdot), v_{i} is the variance of y_{i},and \xi is the dispersion parameter. Given fixed smoothing parameters, the optimal \eta can be estimated by iteratively minimizing the penalized reweighted least-squares functional

\frac{1}{n}\sum_{i=1}^{n}v_{i}^{*}\left(y_{i}^{*} - \eta(\mathbf{x}_{i}) \right)^{2} + \lambda J(\eta)

where v_{i}^{*}=v_{i}/a(\xi) is the weight, y_{i}^{*}=\hat{\eta}(\mathbf{x}_{i})+(y_{i}-\hat{\mu}_{i})/v_{i}^{*} is the adjusted dependent variable, and \hat{\eta}(\mathbf{x}_{i}) is the current estimate of \eta.

The optimal smoothing parameters are chosen via direct cross-validation (see Gu & Xiang, 2001).

Setting gcvtype="acv" uses the Approximate Cross-Validation (ACV) score:

-\frac{1}{n}\sum_{i=1}^{n}\{y_{i}\hat{\eta}(\mathbf{x}_{i}) - b(\hat{\eta}(\mathbf{x}_{i}))\} + \frac{1}{n}\sum_{i=1}^{n}\frac{s_{ii}}{(1-s_{ii})v_{i}^{*}}y_{i}(y_{i}-\hat{\mu}_{i})

where s_{ii} is the i-th diagonal of the smoothing matrix \mathbf{S}_{\boldsymbol\lambda}.

Setting gcvtype="gacv" uses the Generalized ACV (GACV) score:

-\frac{1}{n}\sum_{i=1}^{n}\{y_{i}\hat{\eta}(\mathbf{x}_{i}) - b(\hat{\eta}(\mathbf{x}_{i}))\} + \frac{\mathrm{tr}(\mathbf{S}_{\boldsymbol\lambda}\mathbf{V}^{-1})}{n-\mathrm{tr}(\mathbf{S}_{\boldsymbol\lambda})}\frac{1}{n}\sum_{i=1}^{n}y_{i}(y_{i}-\hat{\mu}_{i})

where \mathbf{S}_{\boldsymbol\lambda} is the smoothing matrix, and \mathbf{V}=\mathrm{diag}(v_{1}^{*},\ldots,v_{n}^{*}).

Setting gcvtype="gacv.old" uses an approximation of the GACV where \frac{1}{n}\mathrm{tr}(\mathbf{S}_{\boldsymbol\lambda}\mathbf{V}^{-1}) is approximated using \frac{1}{n^2}\mathrm{tr}(\mathbf{S}_{\boldsymbol\lambda})\mathrm{tr}(\mathbf{V}^{-1}). This option is included for back-compatibility (ver 1.0-4 and earlier), and is not recommended because the ACV or GACV often perform better.

Note that this function uses the efficient SSA reparameterization described in Helwig (2013) and Helwig and Ma (2015); using is parameterization, there is one unique smoothing parameter per predictor (\gamma_{j}), and these \gamma_{j} parameters determine the structure of the \theta_{k} parameters in the tensor product space. To evaluate the ACV/GACV score, this function uses the improved (scalable) GSSA algorithm discussed in Helwig (in preparation).

Skip Iteration

For p>1 predictors, initial values for the \gamma_{j} parameters (that determine the structure of the \theta_{k} parameters) are estimated using an extension of the smart starting algorithm described in Helwig (2013) and Helwig and Ma (2015).

Default use of this function (skip.iter=TRUE) fixes the \gamma_{j} parameters afer the smart start, and then finds the global smoothing parameter \lambda (among the input lambdas) that minimizes the GCV score. This approach typically produces a solution very similar to the more optimal solution using skip.iter=FALSE.

Setting skip.iter=FALSE uses the same smart starting algorithm as setting skip.iter=TRUE. However, instead of fixing the \gamma_{j} parameters afer the smart start, using skip.iter=FALSE iterates between estimating the optimal \lambda and the optimal \gamma_{j} parameters. The R function nlm is used to minimize the approximate GACV score with respect to the \gamma_{j} parameters, which can be time consuming for models with many predictors and/or a large number of knots.

Note

The spline is estimated using penalized likelihood estimation. Standard errors of the linear predictors are formed using Bayesian confidence intervals.

Author(s)

Nathaniel E. Helwig <helwig@umn.edu>

References

Gu, C. (2013). Smoothing spline ANOVA models, 2nd edition. New York: Springer.

Gu, C. and Xiang, D. (2001). Cross-validating non-Gaussian data: Generalized approximate cross-validation revisited. Journal of Computational and Graphical Statistics, 10, 581-591.

Helwig, N. E. (2017). Regression with ordered predictors via ordinal smoothing splines. Frontiers in Applied Mathematics and Statistics, 3(15), 1-13.

Helwig, N. E. and Ma, P. (2015). Fast and stable multiple smoothing parameter selection in smoothing spline analysis of variance models with large samples. Journal of Computational and Graphical Statistics, 24, 715-732.

Helwig, N. E. and Ma, P. (2016). Smoothing spline ANOVA for super-large samples: Scalable computation via rounding parameters. Statistics and Its Interface, 9, 433-444.

Examples


##########   EXAMPLE 1 (1-way GSSA)   ##########

# define univariate function and data
set.seed(1)
myfun <- function(x){ sin(2*pi*x) }
ndpts <- 1000
x <- runif(ndpts)

# binomial response (no weights)
set.seed(773)
lp <- myfun(x)
p <- 1/(1+exp(-lp))
y <- rbinom(n=ndpts,size=1,p=p)     ## y is binary data
gmod <- bigssg(y~x,family="binomial",type="cub",nknots=20)
crossprod( lp - gmod$linear.predictor )/length(lp)

# binomial response (with weights)
set.seed(773)
lp <- myfun(x)
p <- 1/(1+exp(-lp))
w <- sample(c(10,20,30,40,50),length(p),replace=TRUE)
y <- rbinom(n=ndpts,size=w,p=p)/w   ## y is proportion correct
gmod <- bigssg(y~x,family="binomial",type="cub",nknots=20,weights=w)
crossprod( lp - gmod$linear.predictor )/length(lp)

# poisson response
set.seed(773)
lp <- myfun(x)
mu <- exp(lp)
y <- rpois(n=ndpts,lambda=mu)
gmod <- bigssg(y~x,family="poisson",type="cub",nknots=20)
crossprod( lp - gmod$linear.predictor )/length(lp)

# Gamma response
set.seed(773)
lp <- myfun(x) + 2
mu <- 1/lp
y <- rgamma(n=ndpts,shape=4,scale=mu/4)
gmod <- bigssg(y~x,family="Gamma",type="cub",nknots=20)
1/gmod$dispersion   ## dispersion = 1/shape
crossprod( lp - gmod$linear.predictor )/length(lp)

# inverse gaussian response (not run: requires statmod package)
# require(statmod)
# set.seed(773)
# lp <- myfun(x) + 2
# mu <- sqrt(1/lp)
# y <- rinvgauss(n=ndpts,mean=mu,shape=2)
# gmod <- bigssg(y~x,family="inverse.gaussian",type="cub",nknots=20)
# 1/gmod$dispersion   ## dispersion = 1/shape
# crossprod( lp - gmod$linear.predictor )/length(lp)

# negative binomial response (known dispersion)
set.seed(773)
lp <- myfun(x)
mu <- exp(lp)
y <- rnbinom(n=ndpts,size=.5,mu=mu)
gmod <- bigssg(y~x,family=list("negbin",2),type="cub",nknots=20)
1/gmod$dispersion   ## dispersion = 1/size
crossprod( lp - gmod$linear.predictor )/length(lp)

# negative binomial response (unknown dispersion)
set.seed(773)
lp <- myfun(x)
mu <- exp(lp)
y <- rnbinom(n=ndpts,size=.5,mu=mu)
gmod <- bigssg(y~x,family="negbin",type="cub",nknots=20)
1/gmod$dispersion   ## dispersion = 1/size
crossprod( lp - gmod$linear.predictor )/length(lp)

## Not run: 

##########   EXAMPLE 2 (2-way GSSA)   ##########

# function with two continuous predictors
set.seed(1)
myfun <- function(x1v,x2v){
  sin(2*pi*x1v) + log(x2v+.1) + cos(pi*(x1v-x2v))
}
ndpts <- 1000
x1v <- runif(ndpts)
x2v <- runif(ndpts)

# binomial response (no weights)
set.seed(773)
lp <- myfun(x1v,x2v)
p <- 1/(1+exp(-lp))
y <- rbinom(n=ndpts,size=1,p=p)     ## y is binary data
gmod <- bigssg(y~x1v*x2v,family="binomial",type=list(x1v="cub",x2v="cub"),nknots=50)
crossprod( lp - gmod$linear.predictor )/length(lp)

# binomial response (with weights)
set.seed(773)
lp <- myfun(x1v,x2v)
p <- 1/(1+exp(-lp))
w <- sample(c(10,20,30,40,50),length(p),replace=TRUE)
y <- rbinom(n=ndpts,size=w,p=p)/w   ## y is proportion correct
gmod <- bigssg(y~x1v*x2v,family="binomial",type=list(x1v="cub",x2v="cub"),nknots=50,weights=w)
crossprod( lp - gmod$linear.predictor )/length(lp)

# poisson response
set.seed(773)
lp <- myfun(x1v,x2v)
mu <- exp(lp)
y <- rpois(n=ndpts,lambda=mu)
gmod <- bigssg(y~x1v*x2v,family="poisson",type=list(x1v="cub",x2v="cub"),nknots=50)
crossprod( lp - gmod$linear.predictor )/length(lp)

# Gamma response
set.seed(773)
lp <- myfun(x1v,x2v)+6
mu <- 1/lp
y <- rgamma(n=ndpts,shape=4,scale=mu/4)
gmod <- bigssg(y~x1v*x2v,family="Gamma",type=list(x1v="cub",x2v="cub"),nknots=50)
1/gmod$dispersion   ## dispersion = 1/shape
crossprod( lp - gmod$linear.predictor )/length(lp)

# inverse gaussian response (not run: requires 'statmod' package)
# require(statmod)
# set.seed(773)
# lp <- myfun(x1v,x2v)+6
# mu <- sqrt(1/lp)
# y <- rinvgauss(n=ndpts,mean=mu,shape=2)
# gmod <- bigssg(y~x1v*x2v,family="inverse.gaussian",type=list(x1v="cub",x2v="cub"),nknots=50)
# 1/gmod$dispersion   ## dispersion = 1/shape
# crossprod( lp - gmod$linear.predictor )/length(lp)

# negative binomial response (known dispersion)
set.seed(773)
lp <- myfun(x1v,x2v)
mu <- exp(lp)
y <- rnbinom(n=ndpts,size=.5,mu=mu)
gmod <- bigssg(y~x1v*x2v,family=list("negbin",2),type=list(x1v="cub",x2v="cub"),nknots=50)
1/gmod$dispersion   ## dispersion = 1/size
crossprod( lp - gmod$linear.predictor )/length(lp)

# negative binomial response (unknown dispersion)
set.seed(773)
lp <- myfun(x1v,x2v)
mu <- exp(lp)
y <- rnbinom(n=ndpts,size=.5,mu=mu)
gmod <- bigssg(y~x1v*x2v,family="negbin",type=list(x1v="cub",x2v="cub"),nknots=50)
1/gmod$dispersion   ## dispersion = 1/size
crossprod( lp - gmod$linear.predictor )/length(lp)

## End(Not run)

[Package bigsplines version 1.1-1 Index]