spmeshed {meshed} | R Documentation |
Posterior sampling for models based on MGPs
Description
Fits Bayesian multivariate spatial or spatiotemporal regression models with latent MGPs via Markov chain Monte Carlo.
Usage
spmeshed(y, x, coords, k=NULL,
family = "gaussian",
axis_partition = NULL,
block_size = 30,
grid_size = NULL,
grid_custom = NULL,
n_samples = 1000,
n_burnin = 100,
n_thin = 1,
n_threads = 4,
verbose = 0,
predict_everywhere = FALSE,
settings = list(adapting=TRUE, forced_grid=NULL,
cache=NULL, ps=TRUE, saving=TRUE, low_mem=FALSE, hmc=4),
prior = list(beta=NULL, tausq=NULL, sigmasq = NULL,
phi=NULL, a=NULL, nu = NULL,
toplim = NULL, btmlim = NULL, set_unif_bounds=NULL),
starting = list(beta=NULL, tausq=NULL, theta=NULL, lambda=NULL, v=NULL,
a=NULL, nu = NULL,
mcmcsd=.05,
mcmc_startfrom=0),
debug = list(sample_beta=TRUE, sample_tausq=TRUE,
sample_theta=TRUE, sample_w=TRUE, sample_lambda=TRUE,
verbose=FALSE, debug=FALSE),
indpart=FALSE
)
Arguments
y |
matrix of multivariate outcomes with n rows and q columns. Each row of y corresponds to a row of coords . NA values are accepted in any combination and will be predicted via MCMC.
|
x |
matrix of covariates with n rows and p columns.
|
coords |
matrix of coordinates with n rows and d=2 or d=3 columns for spatial or spacetime regression, respectively.
|
k |
integer k≤q , number of latent processes to use for the linear model of coregionalization. If unspecified, this is set to q =ncol(y) .
|
family |
a vector with length 1 or q whose elements corresponds to the data types of columns of y . Available choices are gaussian , poisson , binomial , beta for outcomes that are continuous, count, binary, or (0,1) proportions.
|
axis_partition |
integer vector of size d : number of intervals each coordinate axis is split into
|
block_size |
integer approximate size of the blocks after domain partitioning. Only used if axis_partition is not specified.
|
grid_size |
integer vector of size d : number of 'knots' of the reference grid along each axis.
This grid is then partitioned using either axis_partition or block_size .
If unspecified, this is set so that the eventual grid size is close to n .
This parameter is ignored if settings$forced_grid=FALSE in which case the data are assumed to be on a grid.
|
grid_custom |
list with elements grid and axis_interval_partition . grid is a data.frame with the user supplied grid of knots. It is possible to include covariate values for the grid locations as additional columns, as long as their number matches ncol(x) - this is useful to make raster images of predictions. axis_interval_partition is the user supplied set of cuts for each coordinate axis (Note: these are the actual cutpoints along the axes, not the number of cuts). If left empty, axis_partition will be used to partition the custom grid. No checks are made on the validity of this grid. This parameter is ignored if settings$forced_grid=FALSE in which case the data are assumed to be on a grid.
|
n_samples |
integer number of MCMC samples at which all the unknowns are stored (including the latent effects).
|
n_burnin |
integer number of MCMC samples to discard at the beginning of the chain.
|
n_thin |
integer thinning parameter for the MCMC chain. Only the chain of latent effects (w ) is thinned to save memory in big data problems. Chains for other unknowns are not thinned and thus will be of length n_thin * n_samples .
|
n_threads |
integer number of OpenMP threads. This is ineffective if meshed was not compiled with OpenMP support.
|
verbose |
integer. If verbose<=20 , then this is the number of times a message is displayed during MCMC. If verbose>20 , then this is the number of MCMC iterations to wait until the next message update. If verbose=Inf , then a message will be printed at each MCMC iteration.
|
predict_everywhere |
bool used if settings$forced_grid=T. Should predictions be made at the reference grid locations? If not, predictions will be made only at the supplied NA values of Y.
|
settings |
list: settings$adapting turns the adaptation of MCMC on/off, settings$forced_grid determines whether or not to use the data grid or a forced grid; if unspecified, the function will try to see what the data look like. Note: if forced_grid=FALSE and n is very large and coords are irregularly spaced, then expect slowdowns in preprocessing and consider using forced_grid=TRUE instead. settings$saving will save model data if set to TRUE . settings$low_mem will only save beta_mcmc , lambda_mcmc , v_mcmc , tausq_mcmc (and not w_mcmc and lp_mcmc , which can be recovered from the others), thereby using less memory. All fitted predictions remain available in yhat_mcmc for convenience. settings$ps (default TRUE ) determines whether to use the PS parametrization (Peruzzi et al 2021). settings$hmc , used if any outcome is not Gaussian, (1: MALA, 2: NUTS, 3: RM-MALA, 4: Simplified manifold preconditioning (default))
|
prior |
list: setup for priors of unknown parameters. prior$phi needs to be specified as the support of the Uniform prior for ϕ . There is currently limited functionality here and some inputs are currently ignored. Defaults are: a vague Gaussian for β , τi2∼IG(2,1) , θj∼IG(2,2) , all subject to change.
|
starting |
list: setup for starting values of unknown parameters. starting$mcmcsd is the initial standard deviation of proposals. starting$mcmc_startfrom is input to the adaptive MCMC and can be used to manually restart MCMC. There is currently limited functionality here and some parameters may be ignored.
|
debug |
list: setup for debugging things. Some parts of MCMC can be turned off here.
|
indpart |
bool defaults to FALSE . If TRUE , this computes an independent partition model.
|
Details
This function targets the following model:
y(s)=x(s)⊤β+Λv(s)+ϵ(s),
where y(s)
is a q
-dimensional vector of outcomes at spatial location s
, x(s)
is a p
-dimensional vector of covariates with static coefficients β
, Λ
is a matrix of factor loadings of size (q,k)
, v(s)
is a k
-dimensional vector which collects the realization of independent Gaussian processes vj∼spmeshed(0,Cj)
for j=1,…,k
and where Cj(s,s′)
is a correlation function. s
is a coordinate in space (d=2
) or space plus time (d=3
). The Meshed GP implemented here associates an axis-parallel tessellation of the domain to a cubic directed acyclic graph (mesh).
Value
coordsdata |
data.frame including the original n coordinates plus the ng knot coordinates if the model was run on a forced grid. The additional column forced_grid has value 1 if the corresponding coordinate is a knot in the forced grid. See examples.
|
savedata |
Available if settings$saving==TRUE . Needed for making predictions using predict() after MCMC. Note: NA values of the output are automatically and more efficiently predicted when running spmeshed .
|
yhat_mcmc |
list of length n_samples whose elements are matrices with n+ng rows and q columns. Each matrix in the list is a posterior predictive sample of the latent spatial process. ng=0 if the data grid is being used. Given the possibly large n , only the thinned chain is output for y .
|
v_mcmc |
list of length n_samples whose elements are matrices with n+ng rows and k columns. Each matrix in the list is a posterior sample of the k latent spatial process. ng=0 if the data grid is being used. Given the possibly large n , only the thinned chain is output for v .
|
w_mcmc |
list of length n_samples whose elements are matrices with n+ng rows and q columns. Each matrix in the list is a posterior sample of w=Λv . ng=0 if the data grid is being used. Given the possibly large n , only the thinned chain is output for w .
|
lp_mcmc |
list of length n_samples whose elements are matrices with n+ng rows and q columns. Each matrix in the list is a posterior sample of the linear predictor Xβ+Λv . ng=0 if the data grid is being used. Given the possibly large n , only the thinned chain is output for w .
|
beta_mcmc |
array of size (p, q, n_thin*n_samples) with the posterior sample for the static regression coefficients β . The j th column of each matrix (p rows and q columns) corresponds to the p linear effects on the j th outcome. The full chain minus burn-in is returned NOT thinned since p and q are relatively small.
|
tausq_mcmc |
matrix of size (q, n_thin*n_samples) . Each row corresponds to the full MCMC chain for the nugget τj2 of the j th outcome in the coregionalization/factor model. The full chain minus burn-in is returned NOT thinned since q is relatively small.
|
theta_mcmc |
array of size (h, k, n_thin*n_samples) with the posterior sample for the correlation function parameters θ . h is 2 for spatial data (corresponding to the spatial decay of the exponential covariance (ϕi,i=1,…,k ), and the variance σi2,i=1,…,k ), 4 for spacetime data (corresponding to temporal decay, spatial decay, and separability – these are referred to as ai , ϕi , and βi,i=1,…,k , in Gneiting (2002), see doi:10.1198/016214502760047113, plus the variance σ2,i=1,…,k ). The full chain minus burn-in is returned NOT thinned since h and k are relatively small. If settings$ps=TRUE , the MCMC output for σi2 (last row of theta_mcmc ) should be discarded and Λ used instead.
|
lambda_mcmc |
array of size (q, k, n_thin*n_samples) . Each matrix (of size (q,k) ) is a posterior sample for Λ in the coregionalization/factor model. In univariate models, this is usually called σ . The full chain minus burn-in is returned NOT thinned since q and k are relatively small.
|
paramsd |
Cholesky factorization of the proposal covariance for adaptive MCMC, after adaptation.
|
mcmc |
Total number of MCMC iterations performed.
|
mcmc_time |
Time in seconds taken for MCMC (not including preprocessing).
|
Author(s)
Michele Peruzzi michele.peruzzi@duke.edu
References
Peruzzi, M., Banerjee, S., and Finley, A.O. (2022)
Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains. Journal of the American Statistical Association, 117(538):969-982. doi:10.1080/01621459.2020.1833889
Peruzzi, M., Banerjee, S., Dunson, D.B., and Finley, A.O. (2021)
Grid-Parametrize-Split (GriPS) for Improved Scalable Inference in Spatial Big Data Analysis. https://arxiv.org/abs/2101.03579
Peruzzi, M. and Dunson, D.B. (2022)
Spatial meshing for general Bayesian multivariate models. https://arxiv.org/abs/2201.10080
Examples
# toy example with tiny dataset and short MCMC
# on a univariate outcome
library(magrittr)
library(dplyr)
library(ggplot2)
library(meshed)
set.seed(2021)
SS <- 12
n <- SS^2 # total n. locations, including missing ones
coords <- expand.grid(xx <- seq(0,1,length.out=SS), xx) %>%
as.matrix()
# generate data
sigmasq <- 2.3
phi <- 6
tausq <- .1
B <- c(-1,.5,1)
CC <- sigmasq * exp(-phi * as.matrix(dist(coords)))
LC <- t(chol(CC))
w <- LC %*% rnorm(n)
p <- length(B)
X <- rnorm(n * p) %>% matrix(ncol=p)
y_full <- X %*% B + w + tausq^.5 * rnorm(n)
set_missing <- rbinom(n, 1, 0.1)
simdata <- data.frame(coords,
y_full = y_full,
w_latent = w) %>%
mutate(y_observed = ifelse(set_missing==1, NA, y_full))
# MCMC setup
mcmc_keep <- 500
mcmc_burn <- 100
mcmc_thin <- 2
y <- simdata$y_observed
ybar <- mean(y, na.rm=TRUE)
meshout <- spmeshed(y-ybar, X, coords,
axis_partition=c(4,4),
n_samples = mcmc_keep,
n_burn = mcmc_burn,
n_thin = mcmc_thin,
prior=list(phi=c(1,15)),
verbose = 0,
n_threads = 1)
# posterior means
best_post_mean <- meshout$beta_mcmc %>% apply(1:2, mean)
# process means
wmesh <- data.frame(w_mgp = meshout$w_mcmc %>% summary_list_mean())
# predictions
ymesh <- data.frame(y_mgp = meshout$yhat_mcmc %>% summary_list_mean())
outdf <-
meshout$coordsdata %>%
cbind(ymesh, wmesh)
# plot predictions
pred_plot <- outdf %>%
ggplot(aes(Var1, Var2, color=y_mgp)) +
geom_point() +
scale_color_viridis_c()
# plot latent process
latent_plot <- outdf %>%
ggplot(aes(Var1, Var2, color=w_mgp)) +
geom_point() +
scale_color_viridis_c()
# estimation of regression coefficients
plot(density(meshout$beta_mcmc[1,1,]))
abline(v=B[1], col="red")
[Package
meshed version 0.2.3
Index]