| ssden {gss} | R Documentation | 
Estimating Probability Density Using Smoothing Splines
Description
Estimate probability densities using smoothing spline ANOVA models.
The symbolic model specification via formula follows the same
rules as in lm, but with the response missing.
Usage
ssden(formula, type=NULL, data=list(), alpha=1.4, weights=NULL,
      subset, na.action=na.omit, id.basis=NULL, nbasis=NULL, seed=NULL,
      domain=as.list(NULL), quad=NULL, qdsz.depth=NULL, bias=NULL,
      prec=1e-7, maxiter=30, skip.iter=FALSE)
ssden1(formula, type=NULL, data=list(), alpha=1.4, weights=NULL,
       subset, na.action=na.omit, id.basis=NULL, nbasis=NULL, seed=NULL,
       domain=as.list(NULL), quad=NULL, prec=1e-7, maxiter=30)
Arguments
| formula | Symbolic description of the model to be fit. | 
| type | List specifying the type of spline for each variable.
See  | 
| data | Optional data frame containing the variables in the model. | 
| alpha | Parameter defining cross-validation score for smoothing parameter selection. | 
| weights | Optional vector of bin-counts for histogram data. | 
| subset | Optional vector specifying a subset of observations to be used in the fitting process. | 
| na.action | Function which indicates what should happen when the data contain NAs. | 
| id.basis | Index of observations to be used as "knots." | 
| nbasis | Number of "knots" to be used.  Ignored when
 | 
| seed | Seed to be used for the random generation of "knots."
Ignored when  | 
| domain | Data frame specifying marginal support of density. | 
| quad | Quadrature for calculating integral. Mandatory if variables other than factors or numerical vectors are involved. | 
| qdsz.depth | Depth to be used in  | 
| bias | Input for sampling bias. | 
| prec | Precision requirement for internal iterations. | 
| maxiter | Maximum number of iterations allowed for internal iterations. | 
| skip.iter | Flag indicating whether to use initial values of
theta and skip theta iteration.  See  | 
Details
The model specification via formula is for the log density.
For example, ~x1*x2 prescribes a model of the form
	log f(x1,x2) = g_{1}(x1) + g_{2}(x2) + g_{12}(x1,x2) + C
    
with the terms denoted by "x1", "x2", and
"x1:x2"; the constant is determined by the fact that a
density integrates to one.
The selective term elimination may characterize (conditional)
independence structures between variables.  For example,
~x1*x2+x1*x3 yields the conditional independence of x2 and x3
given x1.
Parallel to those in a ssanova object, the model terms
are sums of unpenalized and penalized terms.  Attached to every
penalized term there is a smoothing parameter, and the model
complexity is largely determined by the number of smoothing
parameters.
The selection of smoothing parameters is through a cross-validation
mechanism described in the references, with a parameter
alpha; alpha=1 is "unbiased" for the minimization of
Kullback-Leibler loss but may yield severe undersmoothing, whereas
larger alpha yields smoother estimates.
A subset of the observations are selected as "knots."  Unless
specified via id.basis or nbasis, the number of
"knots" q is determined by max(30,10n^{2/9}), which is
appropriate for the default cubic splines for numerical vectors.
Value
ssden returns a list object of class "ssden".
ssden1 returns a list object of class
c("ssden1","ssden").
dssden and cdssden can be used to
evaluate the estimated joint density and conditional density;
pssden, qssden, cpssden,
and cqssden can be used to evaluate (conditional) cdf
and quantiles.
The method project.ssden can be used to calculate the
Kullback-Leibler projection of "ssden" objects for model
selection; project.ssden1 can be used to calculate the
square error projection of "ssden1" objects.
Note
In ssden, default quadrature will be constructed for
numerical vectors on a hyper cube, then outer product with factor
levels will be taken if factors are involved.  The sides of the
hyper cube are specified by domain; for domain$x
missing, the default is
c(min(x),max(x))+c(-1,1)*(max(x)-mimn(x))*.05.  In 1-D, the
quadrature is the 200-point Gauss-Legendre formula returned from
gauss.quad.  In multi-D, delayed Smolyak cubatures
from smolyak.quad are used on cubes with the marginals
properly transformed; see Gu and Wang (2003) for the marginal
transformations.
For reasonable execution time in higher dimensions, set
skip.iter=TRUE in call to ssden.
If you get an error message from ssden stating "Newton
    iteration diverges", try to use a larger qdsz.depth which
will execute slower, or switch to ssden1.  The default values
of qdsz.depth for dimensions 4, 5, 6+ are 12, 11, 10.
ssden1 does not involve multi-D quadrature but does not
perform as well as ssden.  It can be used in very high
dimensions where ssden is infeasible.
The results may vary from run to run.  For consistency, specify
id.basis or set seed.
Author(s)
Chong Gu, chong@stat.purdue.edu
References
Gu, C. and Wang, J. (2003), Penalized likelihood density estimation: Direct cross-validation and scalable approximation. Statistica Sinica, 13, 811–826.
Gu, C., Jeon, Y., and Lin, Y. (2013), Nonparametric density estimation in high dimensions. Statistica Sinica, 23, 1131–1153.
Gu, C. (2013), Smoothing Spline ANOVA Models (2nd Ed). New York: Springer-Verlag.
Gu, C. (2014), Smoothing Spline ANOVA Models: R Package gss. Journal of Statistical Software, 58(5), 1-25. URL http://www.jstatsoft.org/v58/i05/.
Examples
## 1-D estimate: Buffalo snowfall
data(buffalo)
buff.fit <- ssden(~buffalo,domain=data.frame(buffalo=c(0,150)))
plot(xx<-seq(0,150,len=101),dssden(buff.fit,xx),type="l")
plot(xx,pssden(buff.fit,xx),type="l")
plot(qq<-seq(0,1,len=51),qssden(buff.fit,qq),type="l")
## Clean up
## Not run: rm(buffalo,buff.fit,xx,qq)
dev.off()
## End(Not run)
## 2-D with triangular domain: AIDS incubation
data(aids)
## rectangular quadrature
quad.pt <- expand.grid(incu=((1:40)-.5)/40*100,infe=((1:40)-.5)/40*100)
quad.pt <- quad.pt[quad.pt$incu<=quad.pt$infe,]
quad.wt <- rep(1,nrow(quad.pt))
quad.wt[quad.pt$incu==quad.pt$infe] <- .5
quad.wt <- quad.wt/sum(quad.wt)*5e3
## additive model (pre-truncation independence)
aids.fit <- ssden(~incu+infe,data=aids,subset=age>=60,
                  domain=data.frame(incu=c(0,100),infe=c(0,100)),
                  quad=list(pt=quad.pt,wt=quad.wt))
## conditional (marginal) density of infe
jk <- cdssden(aids.fit,xx<-seq(0,100,len=51),data.frame(incu=50))
plot(xx,jk$pdf,type="l")
## conditional (marginal) quantiles of infe (TIME-CONSUMING)
## Not run: 
cqssden(aids.fit,c(.05,.25,.5,.75,.95),data.frame(incu=50))
## End(Not run)
## Clean up
## Not run: rm(aids,quad.pt,quad.wt,aids.fit,jk,xx)
dev.off()
## End(Not run)
## One factor plus one vector
data(gastric)
gastric$trt
fit <- ssden(~futime*trt,data=gastric)
## conditional density
cdssden(fit,c("1","2"),cond=data.frame(futime=150))
## conditional quantiles
cqssden(fit,c(.05,.25,.5,.75,.95),data.frame(trt=as.factor("1")))
## Clean up
## Not run: rm(gastric,fit)
## Sampling bias
## (X,T) is truncated to T<X<1 for T~U(0,1), so X is length-biased
rbias <- function(n) {
  t <- runif(n)
  x <- rnorm(n,.5,.15)
  ok <- (x>t)&(x<1)
  while(m<-sum(!ok)) {
    t[!ok] <- runif(m)
    x[!ok] <- rnorm(m,.5,.15)
    ok <- (x>t)&(x<1)
  }
  cbind(x,t)
}
xt <- rbias(100)
x <- xt[,1]; t <- xt[,2]
## length-biased
bias1 <- list(t=1,wt=1,fun=function(t,x){x[,]})
fit1 <- ssden(~x,domain=list(x=c(0,1)),bias=bias1)
plot(xx<-seq(0,1,len=101),dssden(fit1,xx),type="l")
## truncated
bias2 <- list(t=t,wt=rep(1/100,100),fun=function(t,x){x[,]>t})
fit2 <- ssden(~x,domain=list(x=c(0,1)),bias=bias2)
plot(xx,dssden(fit2,xx),type="l")
## Clean up
## Not run: rm(rbias,xt,x,t,bias1,fit1,bias2,fit2)