growfunctions-package {growfunctions} | R Documentation |
Bayesian Non-Parametric Models for Estimating a Set of Denoised, Latent Functions From an Observed Collection of Domain-Indexed Time-Series
Description
Package: | growfunctions |
Type: | Package |
Version: | 0.16 |
Date: | 2023-12-08 |
License: | GPL (>= 3) |
LazyLoad: | yes |
Details
Parameterizes a model for a collection of noisy time-series indexed by domain or observation unit as an additive function process plus noise process. The latent functions are modeled under both Gaussian process (GP) and intrinsic Gaussian Markov random (iGMRF) field priors. Dependence is estimated among the set of functions by selecting a prior, G, on the covariance or precision parameters of the GP and iGMRF, respectively, where G receives a Dirichlet process (DP) prior. The resulting marginal prior on the functions is a nonparametric scale mixture over the covariance or precision parameters, with G the unknown mixing measure. Draws from a DP are almost surely discrete, allowing for ties (interpreted as clusters) among the covariance or precision parameters indexed by domain or observation unit. Functions are included that permit additional inference on the clustering properties over the domains. The mixture models allow specification of multiple additive latent functions for each of the GP and iGMRF DP mixture formulations. Each function (under a GP prior) may be specified with a covariance function selected from a set of available options that allow for combinations of trend and seasonal components. The same is true for precision matrix constructions under the iGMRF DP mixture.
ESTIMATION FUNCTIONS
gpdpgrow
performs Bayesian nonparametric estimation of a set of dependent, denoised
latent functions under a DP mixture of GP's in an unsupervised fashion based on user input of
an N x T
data matrix, y
, where N
denotes the number of domains or units
and T
denotes the number of time points. The DP prior estimates the dependent structure
among the covariance parameters generating each T x 1
function across the N
domains.
The user may specify multiple latent functions in an additive
regression formulation, where
each covariance kernel may be selected from the squared exponential ("se"
), the
rational quadratic ("rq"
), or a quasi-periodic (product of a period and
squared exponential
to allow the periodic kernel to evolve) ("sn"
).
gmrfdpgrow
also inputs an N x T
data matrix, y
, but replaces the
GP prior used in gpdpgrow()
with an iGMRF prior. The DP mixture is over the precision
parameter, kappa
, that multiplies a fixed matrix, Q
, which specifies
the length-scale
of dependence. The user may specify multiple functions in an additive formulation,
each with its own precision matrix. The precision matrix is specified
via two options. Input q_type
indicates whether the precision matrix is a trend ("tr"
)
precision matrix or a
seasonal ("sn"
) precision matrix. Integer innput q_order
controls the length scale of the estimated functions and specifies the
difference order of the precision matrix in the case of q_type = "tr"
,
or the periodicity, in the case q_type = "sn"
. Typically, q_order
is set to
1
or 2
when q_type = "tr"
, though other values are possible. An optional
N x R
predictor matrix, ksi, may be input to adjust the prior for cluster assignments,
s
, to produce a dependent product partition model.
MSPE
Inputs a gpdpgrow()
or gmrfdpgrow()
object estimated where
some data values deliberately set to missing (NA
) and produces an out-of-sample
mean square prediction error (MSPE), and a normalized MSPE (normalized by the variance of
the missing test set). Both gpdpgrow()
and gmrfdpgrow()
returned objects also
include a leave-one-out log-pseudo marginal likelihood LPML
fit statistic estimated
on the training set (so that no data are required to be left out). One would expect the
nMSPE
statistic to provide a greater penalty for model complexity since it is computed
on data not used for estimation.
predict_functions
Uses the model-estimated GP covariance parameters from
gpdpgrow()
or iGMRF precision parameters from gmrfdpgrow()
to predict functions at future time points beyond the range of the data
from their posterior predictive distributions. Both gpdpgrow()
and gmrfdpgrow()
will predict missing function values in the range of the data.
PLOT FUNCTIONS
cluster_plot
inputs a returned object from either of gpdpgrow()
or
gmrfdpgrow()
and produces two plots. The first plot creates panels for the clusters
and aggregates line plots of posterior mean estimates for member denoised functions.
An optional
smoother may be drawn through the set of functions in each panel to differentiate the
patterns expressed across clusters. A second plot renders the estimated posterior mean
values (with an option for credible intervals) for a single or group of randomly-selected
latent functions against the actual data values, y
.
informative_plot
inputs a list of returned objects, all from either
of gpdpgrow()
or gmrfdpgrow()
(where the model type, GP or
iGMRF, is communicated with input model
) that compares
credible intervals for covariance or precision
parameters where the data are drawn from an informative
sampling design (rather than as iid), so that the distribution
for the population is not the same as that for
the sample. One model conducts estimation in a fashion that ignores the informative
design and the other incorporates sampling weights to account for the informativeness.
Comparing the resulting estimated credible intervals provides a means to assess the
sensitivity of estimated parameters to the sampling design. The set of objects are
labeled with objects_labels
with allowable inputs c("ignore","weight","iid")
.
One of the objects must have label "ignore"
, and the other must have label,
weight
. An additional object may also be input that is estimated from an iid sample
drawn from the same population as the informative sample used for estimation under
objects "ignore"
and "weight"
. Both informative and iid samples are generated
from the synthetic data function, gen_informative_sample()
.
fit_compare
inputs a list of returned objects, each from either
gpdpgrow()
or gmrfdpgrow()
, and plots the posterior mean estimate
for a randomly-selected latent function as compared to the actual data in a set
panels indexed by cluster and object. Allows comparison of fit performance of
functions to data across varied model specifications.
predict_plot
uses a returned object from predict_functions()
to plot both estimated and predicted function values (with the option for
credible intervals) where the prediction interval is outside the range of data.
The plot function, cluster_plot
, should be used where
the predicted values are in the range
of the data (and may, hence, be treated as missing values. The estimated functions are
plotted in cluster_plot
for the missing, as well as observed, data points).
DATA SETS (and functions to generate synthetic data sets)
cps
Data derived from the Current Population Survey (CPS) administered
to households in
local areas within each of 51 states. These data capture a rectangular matrix of
monthly-indexed
all-other employment direct estimates computed from the set of household responses.
The data capture
156 month observations (from 2000 - 2013) for each of the 51 states. Two objects are included:
y_raw
contains the 51 x 156 matrix of all-other employment counts.
y
contains the 51 x 156 matrix of all-other
employment counts are standardization to (0,1), by state, to all
them to be modeled together.
gen_informative_sample
Generates an N x T
population data matrix,
y
, and an associated n x T
sample data matrix, y_obs
, where the
sample is drawn using a 1 or 2-stage informative process. The 1-stage sample uses
unequal probability stratified sampling, while the 2-stage process samples the first stage
in blocks, while the second stage samples with unequal probability from strata for selected
blocks. Both block and strata memberships for population units are generated based on the
variance of their time-series, y
. The resulting sample is informative because the
block and cluster memberships and selection probabilities are determined based on y
.
Author(s)
Terrance Savitsky tds151@gmail.com
References
Terrance D. Savitsky (2016) Bayesian Nonparametric Mixture Estimation for Time-Indexed Functional Data in R, Journal of Statistical Software, Volume 72, Number 2, pages 1 – 34, doi:10.18637/jss.v072.i02.
T. D. Savitsky, D. Toth (2016) Bayesian Estimation Under Informative Sampling, Electronic Journal of Statistics, Volume 10, Number 1.
T. D. Savitsky (2016) Bayesian Non-parametric Functional Mixture Estimation for Time-indexed data. submitted to: Survey Methodology.
Examples
{
library(growfunctions)
## load the monthly employment count
## data for a collection of
## U.S. states from the Current
## Population Survey (cps)
data(cps)
## subselect the columns of N x T, y,
## associated with
## the years 2011 - 2013
## to examine the state level
## employment levels
## during the "great recession"
y_short <- cps$y[,(cps$yr_label %in%
c(2011:2013))]
## run DP mixture of GP's to estimate
## posterior distributions
## for model parameters
## uses default setting of a single
## "rational quadratic"
## covariance formula
## A short number of iterations is used
## for illustration
## Run for 500 iterations with half
## as burn-in to
## get a more useful result
res_gp <- gpdpgrow(
y = y_short,
n.iter = 3,
n.burn = 1,
n.thin = 1,
n.tune = 0)
## 2 plots of estimated functions:
## 1. faceted by cluster and fit;
## 2. data for experimental units.
## for a group of randomly-selected
## functions
fit_plots_gp <- cluster_plot(
object = res_gp, units_name = "state",
units_label = cps$st, single_unit = FALSE,
credible = TRUE )
## Run the DP mixture of iGMRF's
## to estimate posterior
## distributions for model parameters
## Under default RW2(kappa) = order 2 trend
## precision term
## A short number of iterations
## is used for illustration
## Run for 2000 iterations with
## half as burn-in to
## get a more useful result
res_gmrf <- gmrfdpgrow(
y = y_short,
n.iter = 11,
n.burn = 4,
n.thin = 1)
## 2 plots of estimated functions:
## 1. faceted by cluster and fit;
## 2. data for experimental units.
## for a group of
## randomly-selected functions
fit_plots_gmrf <- cluster_plot(
object = res_gmrf, units_name = "state",
units_label = cps$st, single_unit = FALSE,
credible = TRUE )
## visual comparison of fit performance
## between gpdpgrow() and gmrfdpgrow()
## or any two objects returned from any
## combination of these estimation
## functions
objects <- vector("list",2)
objects[[1]] <- res_gmrf
objects[[2]] <- res_gp
label.object <- c("gmrf_tr2","gp_rq")
## the map data.frame
## object from fit_plots gp
## includes a field that
## identifies cluster assignments
## for each unit (or domain)
H <- fit_plots_gp$map$cluster
fit_plot_compare_facet <-
fit_compare( objects = objects,
H = H, label.object = label.object,
y.axis.label = "normalized y",
units_name = "state", units_label = cps$st)
}