R: Posterior sampling for models based on MGPs

spmeshed {meshed}

R Documentation

Posterior sampling for models based on MGPs

Description

Fits Bayesian multivariate spatial or spatiotemporal regression models with latent MGPs via Markov chain Monte Carlo.

Usage

spmeshed(y, x, coords, k=NULL,
       family = "gaussian",
       axis_partition = NULL, 
       block_size = 30,
       grid_size = NULL,
       grid_custom = NULL,
       n_samples = 1000,
       n_burnin = 100,
       n_thin = 1,
       n_threads = 4,
       verbose = 0,
       predict_everywhere = FALSE,
       settings = list(adapting=TRUE, forced_grid=NULL, 
                          cache=NULL, ps=TRUE, saving=TRUE, low_mem=FALSE, hmc=4),
       prior = list(beta=NULL, tausq=NULL, sigmasq = NULL,
                          phi=NULL, a=NULL, nu = NULL,
                          toplim = NULL, btmlim = NULL, set_unif_bounds=NULL),
       starting = list(beta=NULL, tausq=NULL, theta=NULL, lambda=NULL, v=NULL, 
                       a=NULL, nu = NULL,
                       mcmcsd=.05, 
                       mcmc_startfrom=0),
       debug = list(sample_beta=TRUE, sample_tausq=TRUE, 
                    sample_theta=TRUE, sample_w=TRUE, sample_lambda=TRUE,
                    verbose=FALSE, debug=FALSE),
       indpart=FALSE
)

Arguments

`y`	matrix of multivariate outcomes with `n` rows and `q` columns. Each row of `y` corresponds to a row of `coords`. `NA` values are accepted in any combination and will be predicted via MCMC.
`x`	matrix of covariates with `n` rows and `p` columns.
`coords`	matrix of coordinates with `n` rows and `d=2` or `d=3` columns for spatial or spacetime regression, respectively.
`k`	integer `k\leq q`, number of latent processes to use for the linear model of coregionalization. If unspecified, this is set to `q=ncol(y)`.
`family`	a vector with length `1` or `q` whose elements corresponds to the data types of columns of `y`. Available choices are `gaussian`, `poisson`, `binomial`, `beta` for outcomes that are continuous, count, binary, or `(0,1)` proportions.
`axis_partition`	integer vector of size `d`: number of intervals each coordinate axis is split into
`block_size`	integer approximate size of the blocks after domain partitioning. Only used if `axis_partition` is not specified.
`grid_size`	integer vector of size `d`: number of 'knots' of the reference grid along each axis. This grid is then partitioned using either `axis_partition` or `block_size`. If unspecified, this is set so that the eventual grid size is close to `n`. This parameter is ignored if `settings$forced_grid=FALSE` in which case the data are assumed to be on a grid.
`grid_custom`	list with elements `grid` and `axis_interval_partition`. `grid` is a data.frame with the user supplied grid of knots. It is possible to include covariate values for the grid locations as additional columns, as long as their number matches `ncol(x)` - this is useful to make raster images of predictions. `axis_interval_partition` is the user supplied set of cuts for each coordinate axis (Note: these are the actual cutpoints along the axes, not the number of cuts). If left empty, `axis_partition` will be used to partition the custom grid. No checks are made on the validity of this grid. This parameter is ignored if `settings$forced_grid=FALSE` in which case the data are assumed to be on a grid.
`n_samples`	integer number of MCMC samples at which all the unknowns are stored (including the latent effects).
`n_burnin`	integer number of MCMC samples to discard at the beginning of the chain.
`n_thin`	integer thinning parameter for the MCMC chain. Only the chain of latent effects (`w`) is thinned to save memory in big data problems. Chains for other unknowns are not thinned and thus will be of length `n_thin * n_samples`.
`n_threads`	integer number of OpenMP threads. This is ineffective if `meshed` was not compiled with OpenMP support.
`verbose`	integer. If `verbose<=20`, then this is the number of times a message is displayed during MCMC. If `verbose>20`, then this is the number of MCMC iterations to wait until the next message update. If `verbose=Inf`, then a message will be printed at each MCMC iteration.
`predict_everywhere`	bool used if settings$forced_grid=T. Should predictions be made at the reference grid locations? If not, predictions will be made only at the supplied NA values of Y.
`settings`	list: `settings$adapting` turns the adaptation of MCMC on/off, `settings$forced_grid` determines whether or not to use the data grid or a forced grid; if unspecified, the function will try to see what the data look like. Note: if `forced_grid=FALSE` and `n` is very large and `coords` are irregularly spaced, then expect slowdowns in preprocessing and consider using `forced_grid=TRUE` instead. `settings$saving` will save model data if set to `TRUE`. `settings$low_mem` will only save `beta_mcmc`, `lambda_mcmc`, `v_mcmc`, `tausq_mcmc` (and not `w_mcmc` and `lp_mcmc`, which can be recovered from the others), thereby using less memory. All fitted predictions remain available in `yhat_mcmc` for convenience. `settings$ps` (default `TRUE`) determines whether to use the PS parametrization (Peruzzi et al 2021). `settings$hmc`, used if any outcome is not Gaussian, (1: MALA, 2: NUTS, 3: RM-MALA, 4: Simplified manifold preconditioning (default))
`prior`	list: setup for priors of unknown parameters. `prior$phi` needs to be specified as the support of the Uniform prior for `\phi`. There is currently limited functionality here and some inputs are currently ignored. Defaults are: a vague Gaussian for `\beta`, `\tau^2_i \sim IG(2,1)`, `\theta_j \sim IG(2,2)`, all subject to change.
`starting`	list: setup for starting values of unknown parameters. `starting$mcmcsd` is the initial standard deviation of proposals. `starting$mcmc_startfrom` is input to the adaptive MCMC and can be used to manually restart MCMC. There is currently limited functionality here and some parameters may be ignored.
`debug`	list: setup for debugging things. Some parts of MCMC can be turned off here.
`indpart`	bool defaults to `FALSE`. If `TRUE`, this computes an independent partition model.

Details

This function targets the following model:

y(s) = x(s)^\top \beta + \Lambda v(s) + \epsilon(s),

where y(s) is a q-dimensional vector of outcomes at spatial location s, x(s) is a p-dimensional vector of covariates with static coefficients \beta, \Lambda is a matrix of factor loadings of size (q, k), v(s) is a k-dimensional vector which collects the realization of independent Gaussian processes v_j \sim spmeshed(0, C_j) for j=1, \dots, k and where C_j(s, s') is a correlation function. s is a coordinate in space (d=2) or space plus time (d=3). The Meshed GP implemented here associates an axis-parallel tessellation of the domain to a cubic directed acyclic graph (mesh).

Value

`coordsdata`	data.frame including the original `n` coordinates plus the `n_g` knot coordinates if the model was run on a forced grid. The additional column `forced_grid` has value 1 if the corresponding coordinate is a knot in the forced grid. See examples.
`savedata`	Available if `settings$saving==TRUE`. Needed for making predictions using `predict()` after MCMC. Note: `NA` values of the output are automatically and more efficiently predicted when running `spmeshed`.
`yhat_mcmc`	list of length `n_samples` whose elements are matrices with `n + n_g` rows and `q` columns. Each matrix in the list is a posterior predictive sample of the latent spatial process. `n_g = 0` if the data grid is being used. Given the possibly large `n`, only the thinned chain is output for `y`.
`v_mcmc`	list of length `n_samples` whose elements are matrices with `n + n_g` rows and `k` columns. Each matrix in the list is a posterior sample of the `k` latent spatial process. `n_g = 0` if the data grid is being used. Given the possibly large `n`, only the thinned chain is output for `v`.
`w_mcmc`	list of length `n_samples` whose elements are matrices with `n + n_g` rows and `q` columns. Each matrix in the list is a posterior sample of `w = \Lambda v`. `n_g = 0` if the data grid is being used. Given the possibly large `n`, only the thinned chain is output for `w`.
`lp_mcmc`	list of length `n_samples` whose elements are matrices with `n + n_g` rows and `q` columns. Each matrix in the list is a posterior sample of the linear predictor `X\beta + \Lambda v`. `n_g = 0` if the data grid is being used. Given the possibly large `n`, only the thinned chain is output for `w`.
`beta_mcmc`	array of size `(p, q, n_thin*n_samples)` with the posterior sample for the static regression coefficients `\beta`. The `j`th column of each matrix (`p` rows and `q` columns) corresponds to the `p` linear effects on the `j`th outcome. The full chain minus burn-in is returned NOT thinned since `p` and `q` are relatively small.
`tausq_mcmc`	matrix of size `(q, n_thin*n_samples)`. Each row corresponds to the full MCMC chain for the nugget `\tau^2_j` of the `j`th outcome in the coregionalization/factor model. The full chain minus burn-in is returned NOT thinned since `q` is relatively small.
`theta_mcmc`	array of size `(h, k, n_thin*n_samples)` with the posterior sample for the correlation function parameters `\theta`. `h` is 2 for spatial data (corresponding to the spatial decay of the exponential covariance (`\phi_i, i=1, \dots, k`), and the variance `\sigma^2_i, i=1, \dots, k`), 4 for spacetime data (corresponding to temporal decay, spatial decay, and separability – these are referred to as `a_i`, `\phi_i`, and `\beta_i, i=1, \dots, k`, in Gneiting (2002), see doi:10.1198/016214502760047113, plus the variance `\sigma^2, i=1, \dots, k`). The full chain minus burn-in is returned NOT thinned since `h` and `k` are relatively small. If `settings$ps=TRUE`, the MCMC output for `\sigma^2_i` (last row of `theta_mcmc`) should be discarded and `\Lambda` used instead.
`lambda_mcmc`	array of size `(q, k, n_thin*n_samples)`. Each matrix (of size `(q,k)`) is a posterior sample for `\Lambda` in the coregionalization/factor model. In univariate models, this is usually called `\sigma`. The full chain minus burn-in is returned NOT thinned since `q` and `k` are relatively small.
`paramsd`	Cholesky factorization of the proposal covariance for adaptive MCMC, after adaptation.
`mcmc`	Total number of MCMC iterations performed.
`mcmc_time`	Time in seconds taken for MCMC (not including preprocessing).

Author(s)

Michele Peruzzi michele.peruzzi@duke.edu

References

Peruzzi, M., Banerjee, S., and Finley, A.O. (2022) Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains. Journal of the American Statistical Association, 117(538):969-982. doi:10.1080/01621459.2020.1833889

Peruzzi, M., Banerjee, S., Dunson, D.B., and Finley, A.O. (2021) Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis. https://arxiv.org/abs/2101.03579

Peruzzi, M. and Dunson, D.B. (2022) Spatial meshing for general Bayesian multivariate models. https://arxiv.org/abs/2201.10080

Examples


# toy example with tiny dataset and short MCMC
# on a univariate outcome
library(magrittr)
library(dplyr)
library(ggplot2)
library(meshed)

set.seed(2021)

SS <- 12
n <- SS^2 # total n. locations, including missing ones

coords <- expand.grid(xx <- seq(0,1,length.out=SS), xx) %>% 
  as.matrix()

# generate data
sigmasq <- 2.3
phi <- 6
tausq <- .1
B <- c(-1,.5,1)

CC <- sigmasq * exp(-phi * as.matrix(dist(coords)))
LC <- t(chol(CC))
w <- LC %*% rnorm(n)
p <- length(B)
X <- rnorm(n * p) %>% matrix(ncol=p)
y_full <- X %*% B + w + tausq^.5 * rnorm(n)

set_missing <- rbinom(n, 1, 0.1)

simdata <- data.frame(coords,
                      y_full = y_full,
                      w_latent = w) %>%
  mutate(y_observed = ifelse(set_missing==1, NA, y_full))

# MCMC setup
mcmc_keep <- 500
mcmc_burn <- 100
mcmc_thin <- 2

y <- simdata$y_observed
ybar <- mean(y, na.rm=TRUE)

meshout <- spmeshed(y-ybar, X, coords,
                    axis_partition=c(4,4),
                    n_samples = mcmc_keep, 
                    n_burn = mcmc_burn, 
                    n_thin = mcmc_thin, 
                    prior=list(phi=c(1,15)),
                    verbose = 0,
                    n_threads = 1)

# posterior means
best_post_mean <- meshout$beta_mcmc %>% apply(1:2, mean)
  
# process means
wmesh <- data.frame(w_mgp = meshout$w_mcmc %>% summary_list_mean())
# predictions
ymesh <- data.frame(y_mgp = meshout$yhat_mcmc %>% summary_list_mean())

outdf <- 
  meshout$coordsdata %>% 
  cbind(ymesh, wmesh) 

# plot predictions
pred_plot <- outdf %>% 
  ggplot(aes(Var1, Var2, color=y_mgp)) +
  geom_point() +
  scale_color_viridis_c()

# plot latent process
latent_plot <- outdf %>% 
  ggplot(aes(Var1, Var2, color=w_mgp)) +
  geom_point() + 
  scale_color_viridis_c()

# estimation of regression coefficients
plot(density(meshout$beta_mcmc[1,1,]))
abline(v=B[1], col="red")

[Package meshed version 0.2.3 Index]