R: Bayesian Non-Parametric Models for Estimating a Set of...

growfunctions-package {growfunctions}

R Documentation

Bayesian Non-Parametric Models for Estimating a Set of Denoised, Latent Functions From an Observed Collection of Domain-Indexed Time-Series

Description

Package:	growfunctions
Type:	Package
Version:	0.16
Date:	2023-12-08
License:	GPL (>= 3)
LazyLoad:	yes

Details

Parameterizes a model for a collection of noisy time-series indexed by domain or observation unit as an additive function process plus noise process. The latent functions are modeled under both Gaussian process (GP) and intrinsic Gaussian Markov random (iGMRF) field priors. Dependence is estimated among the set of functions by selecting a prior, G, on the covariance or precision parameters of the GP and iGMRF, respectively, where G receives a Dirichlet process (DP) prior. The resulting marginal prior on the functions is a nonparametric scale mixture over the covariance or precision parameters, with G the unknown mixing measure. Draws from a DP are almost surely discrete, allowing for ties (interpreted as clusters) among the covariance or precision parameters indexed by domain or observation unit. Functions are included that permit additional inference on the clustering properties over the domains. The mixture models allow specification of multiple additive latent functions for each of the GP and iGMRF DP mixture formulations. Each function (under a GP prior) may be specified with a covariance function selected from a set of available options that allow for combinations of trend and seasonal components. The same is true for precision matrix constructions under the iGMRF DP mixture.

ESTIMATION FUNCTIONS

gpdpgrow performs Bayesian nonparametric estimation of a set of dependent, denoised latent functions under a DP mixture of GP's in an unsupervised fashion based on user input of an N x T data matrix, y, where N denotes the number of domains or units and T denotes the number of time points. The DP prior estimates the dependent structure among the covariance parameters generating each T x 1 function across the N domains. The user may specify multiple latent functions in an additive regression formulation, where each covariance kernel may be selected from the squared exponential ("se"), the rational quadratic ("rq"), or a quasi-periodic (product of a period and squared exponential to allow the periodic kernel to evolve) ("sn").

gmrfdpgrow also inputs an N x T data matrix, y, but replaces the GP prior used in gpdpgrow() with an iGMRF prior. The DP mixture is over the precision parameter, kappa, that multiplies a fixed matrix, Q, which specifies the length-scale of dependence. The user may specify multiple functions in an additive formulation, each with its own precision matrix. The precision matrix is specified via two options. Input q_type indicates whether the precision matrix is a trend ("tr") precision matrix or a seasonal ("sn") precision matrix. Integer innput q_order controls the length scale of the estimated functions and specifies the difference order of the precision matrix in the case of q_type = "tr", or the periodicity, in the case q_type = "sn". Typically, q_order is set to 1 or 2 when q_type = "tr", though other values are possible. An optional N x R predictor matrix, ksi, may be input to adjust the prior for cluster assignments, s, to produce a dependent product partition model.

MSPE Inputs a gpdpgrow() or gmrfdpgrow() object estimated where some data values deliberately set to missing (NA) and produces an out-of-sample mean square prediction error (MSPE), and a normalized MSPE (normalized by the variance of the missing test set). Both gpdpgrow() and gmrfdpgrow() returned objects also include a leave-one-out log-pseudo marginal likelihood LPML fit statistic estimated on the training set (so that no data are required to be left out). One would expect the nMSPE statistic to provide a greater penalty for model complexity since it is computed on data not used for estimation.

predict_functions Uses the model-estimated GP covariance parameters from gpdpgrow() or iGMRF precision parameters from gmrfdpgrow() to predict functions at future time points beyond the range of the data from their posterior predictive distributions. Both gpdpgrow() and gmrfdpgrow() will predict missing function values in the range of the data.

PLOT FUNCTIONS

cluster_plot inputs a returned object from either of gpdpgrow() or gmrfdpgrow() and produces two plots. The first plot creates panels for the clusters and aggregates line plots of posterior mean estimates for member denoised functions. An optional smoother may be drawn through the set of functions in each panel to differentiate the patterns expressed across clusters. A second plot renders the estimated posterior mean values (with an option for credible intervals) for a single or group of randomly-selected latent functions against the actual data values, y.

informative_plot inputs a list of returned objects, all from either of gpdpgrow() or gmrfdpgrow() (where the model type, GP or iGMRF, is communicated with input model) that compares credible intervals for covariance or precision parameters where the data are drawn from an informative sampling design (rather than as iid), so that the distribution for the population is not the same as that for the sample. One model conducts estimation in a fashion that ignores the informative design and the other incorporates sampling weights to account for the informativeness. Comparing the resulting estimated credible intervals provides a means to assess the sensitivity of estimated parameters to the sampling design. The set of objects are labeled with objects_labels with allowable inputs c("ignore","weight","iid"). One of the objects must have label "ignore", and the other must have label, weight. An additional object may also be input that is estimated from an iid sample drawn from the same population as the informative sample used for estimation under objects "ignore" and "weight". Both informative and iid samples are generated from the synthetic data function, gen_informative_sample().

fit_compare inputs a list of returned objects, each from either gpdpgrow() or gmrfdpgrow(), and plots the posterior mean estimate for a randomly-selected latent function as compared to the actual data in a set panels indexed by cluster and object. Allows comparison of fit performance of functions to data across varied model specifications.

predict_plot uses a returned object from predict_functions() to plot both estimated and predicted function values (with the option for credible intervals) where the prediction interval is outside the range of data. The plot function, cluster_plot, should be used where the predicted values are in the range of the data (and may, hence, be treated as missing values. The estimated functions are plotted in cluster_plot for the missing, as well as observed, data points).

DATA SETS (and functions to generate synthetic data sets)

cps Data derived from the Current Population Survey (CPS) administered to households in local areas within each of 51 states. These data capture a rectangular matrix of monthly-indexed all-other employment direct estimates computed from the set of household responses. The data capture 156 month observations (from 2000 - 2013) for each of the 51 states. Two objects are included: y_raw contains the 51 x 156 matrix of all-other employment counts. y contains the 51 x 156 matrix of all-other employment counts are standardization to (0,1), by state, to all them to be modeled together.

gen_informative_sample Generates an N x T population data matrix, y, and an associated n x T sample data matrix, y_obs, where the sample is drawn using a 1 or 2-stage informative process. The 1-stage sample uses unequal probability stratified sampling, while the 2-stage process samples the first stage in blocks, while the second stage samples with unequal probability from strata for selected blocks. Both block and strata memberships for population units are generated based on the variance of their time-series, y. The resulting sample is informative because the block and cluster memberships and selection probabilities are determined based on y.

Author(s)

Terrance Savitsky tds151@gmail.com

References

Terrance D. Savitsky (2016) Bayesian Nonparametric Mixture Estimation for Time-Indexed Functional Data in R, Journal of Statistical Software, Volume 72, Number 2, pages 1 – 34, doi:10.18637/jss.v072.i02.

T. D. Savitsky, D. Toth (2016) Bayesian Estimation Under Informative Sampling, Electronic Journal of Statistics, Volume 10, Number 1.

T. D. Savitsky (2016) Bayesian Non-parametric Functional Mixture Estimation for Time-indexed data. submitted to: Survey Methodology.

Examples

{
library(growfunctions)

## load the monthly employment count 
## data for a collection of 
## U.S. states from the Current 
## Population Survey (cps)
data(cps)
## subselect the columns of N x T, y, 
## associated with 
## the years 2011 - 2013
## to examine the state level 
## employment levels 
## during the "great recession"
y_short <- cps$y[,(cps$yr_label %in% 
                  c(2011:2013))]

## run DP mixture of GP's to estimate 
## posterior distributions 
## for model parameters
## uses default setting of a single 
## "rational quadratic" 
## covariance formula
## A short number of iterations is used 
## for illustration
## Run for 500 iterations with half 
## as burn-in to 
## get a more useful result
res_gp     <- gpdpgrow(
                   y = y_short, 
                   n.iter = 3, 
                   n.burn = 1, 
                   n.thin = 1, 
                   n.tune = 0)  
## 2 plots of estimated functions: 
## 1. faceted by cluster and fit;
## 2.  data for experimental units.
## for a group of randomly-selected 
## functions
fit_plots_gp   <- cluster_plot( 
  object = res_gp, units_name = "state", 
  units_label = cps$st, single_unit = FALSE, 
  credible = TRUE )
                                   
## Run the DP mixture of iGMRF's 
## to estimate posterior 
## distributions for model parameters
## Under default RW2(kappa) = order 2 trend 
## precision term
## A short number of iterations 
## is used for illustration
## Run for 2000 iterations with 
## half as burn-in to 
## get a more useful result
res_gmrf <- gmrfdpgrow(
                 y = y_short, 
                 n.iter = 11, 
                 n.burn = 4, 
                 n.thin = 1) 
                                     
## 2 plots of estimated functions: 
## 1. faceted by cluster and fit;
## 2.  data for experimental units.
## for a group of 
## randomly-selected functions
fit_plots_gmrf <- cluster_plot( 
  object = res_gmrf, units_name = "state", 
  units_label = cps$st, single_unit = FALSE, 
  credible = TRUE )                                    
                                     
## visual comparison of fit performance 
## between gpdpgrow() and gmrfdpgrow()
## or any two objects returned from any
## combination of these estimation
## functions
objects        <- vector("list",2)
objects[[1]]   <- res_gmrf
objects[[2]]   <- res_gp
label.object   <- c("gmrf_tr2","gp_rq")
## the map data.frame 
## object from fit_plots gp 
## includes a field that 
## identifies cluster assignments
## for each unit (or domain)
H        <- fit_plots_gp$map$cluster
fit_plot_compare_facet <- 
fit_compare( objects = objects, 
 H = H, label.object = label.object,
 y.axis.label = "normalized y",
 units_name = "state", units_label = cps$st)                                
}

[Package growfunctions version 0.16 Index]