R: Bayesian non-parametric dependent Gaussian process model for...

gpdpgrow {growfunctions}

R Documentation

Bayesian non-parametric dependent Gaussian process model for time-indexed functional data

Description

Estimates a collection of time-indexed functions with Gaussian process (GP) formulations where a Dirichlet process mixture allows sub-groupings of the functions to share the same GP covariance parameters. The GP formulation supports any number of additive GP covariance terms, expressing either or both of multiple trend and seasonality.

Usage

gpdpgrow(
  y,
  ipr,
  time_points,
  gp_cov,
  sn_order,
  jitter,
  gp_shape,
  gp_rate,
  noise_shape,
  noise_rate,
  dp_shape,
  dp_rate,
  M_init,
  lower,
  upper,
  sub_size,
  w_star,
  w,
  n.iter,
  n.burn,
  n.thin,
  n.tune,
  progress,
  b_move,
  cluster,
  s
)

Arguments

`y`	A multivariate continuous response, specified as an N x T matrix, where `N` denotes the number of functions and `T`, the number of time points per function. Intermittent missing-at-random values are allowed and will be estimated from the posterior predictive distribution. Missing cells should be denoted with `NA`. The sampling of missing values requires co-sampling the functions, `bb`, while these functions are marginalized out for sampling under the case of no missing data. So the sampler will run more slowly in the case of intermittent missingness.
`ipr`	An optional input vector of inclusion probabilities for each observation unit in the case the observed data were acquired through an informative sampling design, so that unbiased inference about the population requires adjustments to the observed sample. Defaults to `ipr = rep(1,nrow(y))` indicating an iid sample.
`time_points`	Inputs a vector of common time points at which the collections of functions were observed (with the possibility of intermittent missingness). The length of `time_points` should be equal to the number of columns in the data matrix, `y`. Defaults to `time_points = 1:ncol(y)`.
`gp_cov`	A vector of length `L` to select the covariance function for each of `L` terms. Allowed inputs are `c("rq","se","sn")`, where "rq" denotes the rational quadratic covariance function, "se", the square exponential, and "sn", seasonality. The seasonality covariance is a quasi-periodic multiplicative combination of a fixed length- scale periodic covariance kernel (where the period length is specified in 'sn_order') and a squared exponential kernel (of varying length-scale) to allow the periodicity in the seasonal term to evolve. e.g. If 4 terms are wanted - 2 trend terms with "se" and 2 seasonality terms, the input would be `gp_cov = c("se","se","sn","sn")`. Defaults to `gp_cov = "rq"`.
`sn_order`	A vector of length `L_s`, the number of terms in `gp_cov` where `"sn"` is selected, that denotes the seasonality order for each term; e.g. if the two "sn" terms above are for 3 and 12 month seasonality, respectively, for monthly data, then `sn_order = c(3,12)`. Defaults to `sn_order = NULL`.
`jitter`	A scalar numerical value added to the diagonal elements of the T x T GP covariance matrix to stabilize computation. Defaults to `jitter = 0.01`.
`gp_shape`	The shape parameter of the Gamma base distribution for the DP prior on the P x N matrix of GP covariance parameters (where P denotes the number of parameters for each of the N experimental units). Defaults to `gp_shape = 1`
`gp_rate`	The rate parameter of the Gamma base distribution on GP covariance parameters. Defaults to `gp_rate = 1`.
`noise_shape`	The shape parameter of the Gamma base distribution on `tau_e`, the model noise precision parameter. Defaults to `noise_shape = 3`.
`noise_rate`	The rate parameter of the Gamma base distribution on `tau_e`, the model noise precision parameter. Defaults to `noise_rate = 1`.
`dp_shape`	The shape parameter for the Gamma prior on the DP concentration parameter, `conc`. Defaults to `dp_shape = 1`.
`dp_rate`	The rate parameter for the Gamma prior on the DP concentration parameter, `conc`. Defaults to `dp_rate = 1`.
`M_init`	Starting number of clusters of `nrow(y)` units to initialize sampler. Defaults to `M_init = nrow(y)`.
`lower`	The lower end of the range to be used in conditionally sampling the GP covariance parameters (`kappa,tau_e`) in the slice sampler. Defaults to `lower = 0`.
`upper`	The upper end of the range to be used in conditionally sampling the GP covariance parameters (`kappa,tau_e`) in the slice sampler. Defaults to `upper = 1e10`.
`sub_size`	Integer vector whose length, `n`, equals the number of progressively coarser GP covariance matrices to use for tempered sampling steps in an alternative space to sample the GP covariance parameters. Each entry denotes the number of sub-sample time points in `T` to draw under a latin hypercube design from the N x T data matrix, `y` to employ for that distribution; for example, suppose `T = 300`. `sub_size = c(100,50)`, would randomly select first `100` and then `50` time points from `y` to use for each of the `n = 2` distributions. Defaults to `sub_size = c(floor(0.25T),floor(0.1T))`.
`w_star`	Integer value denoting the number of cluster locations to sample ahead of observations in the auxiliary Gibbs sampler used to sample the number of clusters and associated cluster assignments. A higher value reduces samplin auto-correlation, but increases computational burden. Defaults to `w_star = 2`.
`w`	Numeric value denoting the step width used to construct the interval from which to draw a sample for each GP covariance parameter in the slice sampler. This value is adaptively updated in the sampler tuning stage for each parameter to be equal to the difference in the 0.95 and 0.05 sample quantiles for each of 5 block updates. Defaults to `w = 1.5`.
`n.iter`	Total number of MCMC iterations.
`n.burn`	Number of MCMC iterations to discard. `gpdpgrow` will return `(n.iter - n.burn)` posterior samples.
`n.thin`	Gap between successive sampling iterations to save.
`n.tune`	Number of iterations (before ergodic chain instantiated) to adapt `w`, separately, for each covariance term, `p = 1,...,P`. Sets each `w_p` to lie in the 90 percent credible interval computed from the tuning sample (that is divided into 5 blocks so that `w_p` is successively updated in each block of runs).
`progress`	A boolean value denoting whether to display a progress bar during model execution. Defaults to `progress = TRUE`
`b_move`	A boolean value denoting whether to sample the GP function, `bb`, in T x 1 Gibbs steps `b_move = TRUE` or through elliptical slice sampling. Defaults to `b_move = TRUE`. Only used in the case there is any intermittent missingness; otherwise `bb` is marginalized out of the sampler and post-sampled from the its predictive distribution.
`cluster`	A boolean value denoting whether to employ DP mix model over set of GP functions or to just use GP model with no clustering of covariance function parameters. Defaults to `cluster = TRUE`
`s`	An N x 1 integer vector that inputs a fixed clustering, rather than sampling it. Defaults to `s = NULL`

Value

S3 gpdpgrow object, for which many methods are available to return and view results. Generic functions applied to an object, res of class gpdpgrow, includes:

`samples(res)`	contains (`n.iter - n.burn`) posterior sampling iterations for every model parameter
`resid(res)`	contains the model residuals.

Note

The intended focus for this package are data composed of observed noisy functions (each of length T) for a set of experimental units where the functions may express dependence among the experimental units

Author(s)

Terrance Savitsky tds151@gmail.com Daniell Toth danielltoth@yahoo.com

References

T. D. Savitsky and D. Toth (2014) Bayesian Non-parametric Models for Collections of Time- indexed Functions. submitted to: JRSS Series A (Statistics in Society).

T. D. Savitsky (2014) Bayesian Non-parametric Functional Mixture Estimation for Time-indexed data. submitted to: Annals of Applied Statistics.

T. D. Savitsky (2014) Bayesian Non-Parametric Mixture Estimation for Time-Indexed Functional Data for R. Submitted to: Journal of Statistical Software.

Examples

{
library(growfunctions)

## load the monthly employment count data for a collection of 
## U.S. states from the Current 
## Population Survey (cps)
data(cps)
## subselect the columns of N x T, y, associated with 
## the years 2011 - 2013
## to examine the state level employment 
## levels during the "great recession"
y_short   <- cps$y[,(cps$yr_label %in% c(2011:2013))]

## uses default setting of a single "rational quadratic" covariance
## run for 500 iterations, with half discarded as burn-in to 
## obtain a more useful result.
res_gp               <- gpdpgrow(y = y_short, 
                                 n.iter = 4, 
                                 n.burn = 1, 
                                 n.thin = 1, 
                                 n.tune = 0)  

## Two plots of estimated functions, 
## 1. faceted by cluster 
## 2. fitted functions vs noisy observations
## first plot will plot estimated denoised function, 
## bb_i, for a single (randomly-selected) "state"
fit_plots_gp        <- cluster_plot( object = res_gp,  
                           units_name = "state", 
                           units_label = cps$st, 
                           single_unit = TRUE, 
                           credible = TRUE )
## second plot will randomly select 6 states 
## and plot their estimated denoised functions, bb_i.
## with setting "single_unit = FALSE". 
## (Option "num_plot" may be set to plot 
## any integer number of 
## randomly-selected units.)
fit_plots_gp        <- cluster_plot( object = res_gp,  
                                     units_name = "state", 
                                     units_label = cps$st, 
                                     single_unit = FALSE, 
                                     credible = TRUE )

}

[Package growfunctions version 0.16 Index]