R: fit a parallel partitioned GLS

MC_GLSpart {remotePARTS}

R Documentation

fit a parallel partitioned GLS

Description

fit a GLS model to a large data set by partitioning the data into smaller pieces (partitions) and processing these pieces individually and summarizing output across partitions to conduct hypothesis tests.

Usage

MC_GLSpart(
  formula,
  partmat,
  formula0 = NULL,
  part_FUN = "part_data",
  distm_FUN = "distm_scaled",
  covar_FUN = "covar_exp",
  covar.pars = c(range = 0.1),
  nugget = NA,
  ncross = 6,
  save.GLS = FALSE,
  ncores = parallel::detectCores(logical = FALSE) - 1,
  debug = FALSE,
  ...
)

MCGLS_partsummary(
  MCpartGLS,
  covar.pars = c(range = 0.1),
  save.GLS = FALSE,
  partsize
)

multicore_fitGLS_partition(
  formula,
  partmat,
  formula0 = NULL,
  part_FUN = "part_data",
  distm_FUN = "distm_scaled",
  covar_FUN = "covar_exp",
  covar.pars = c(range = 0.1),
  nugget = NA,
  ncross = 6,
  save.GLS = FALSE,
  ncores = parallel::detectCores(logical = FALSE) - 1,
  do.t.test = TRUE,
  do.chisqr.test = TRUE,
  debug = FALSE,
  ...
)

fitGLS_partition(
  formula,
  partmat,
  formula0 = NULL,
  part_FUN = "part_data",
  distm_FUN = "distm_scaled",
  covar_FUN = "covar_exp",
  covar.pars = c(range = 0.1),
  nugget = NA,
  ncross = 6,
  save.GLS = FALSE,
  do.t.test = TRUE,
  do.chisqr.test = TRUE,
  progressbar = TRUE,
  debug = FALSE,
  ncores = NA,
  parallel = TRUE,
  ...
)

part_data(index, formula, data, formula0 = NULL, coord.names = c("lng", "lat"))

part_csv(index, formula, file, formula0 = NULL, coord.names = c("lng", "lat"))

Arguments

`formula`	a formula for the GLS model
`partmat`	a numeric partition matrix, with values containing indices of locations.
`formula0`	an optional formula for the null GLS model
`part_FUN`	a function to partition individual data. See details for more information about requirements for this function.
`distm_FUN`	a function to calculate distances from a coordinate matrix
`covar_FUN`	a function to calculate covariances from a distance matrix
`covar.pars`	a named list of parameters passed to `covar_FUN`
`nugget`	a numeric fixed nugget component: if NA, the nugget is estimated for each partition
`ncross`	an integer indicating the number of partitions used to calculate cross-partition statistics
`save.GLS`	logical: should full GLS output be saved for each partition?
`ncores`	an optional integer indicating how many CPU threads to use for calculations.
`debug`	logical debug mode
`...`	arguments passed to `part_FUN`
`MCpartGLS`	object resulting from MC_partGLS()
`partsize`	number of locations per partition
`do.t.test`	logical: should a t-test of the GLS coefficients be conducted?
`do.chisqr.test`	logical: should a correlated chi-squared test of the model fit be conducted?
`progressbar`	logical: should progress be tracked with a progress bar?
`parallel`	logical: should all calculations be done in parallel? See details for more information
`index`	a vector of pixels with which to subset the data
`data`	a data frame
`coord.names`	a vector containing names of spatial coordinate variables (x and y, respectively)
`file`	a text string indicating the csv file from which to read data

Details

The function specified by part_FUN is called internally to obtain properly formatted subsets of the full data (i.e., partitions). Two functions are provided in the remotePARTs package for this purpose: part_data and part_csv. Both of these functions have required arguments that must be specified through the call to fitGLS_partition (via ...). Check each function's argument list and see "part_FUN details" below for more information.

partmat is used to partition the data. partmat must be a complete matrix, without any missing or non-finite values. Columns of partmat are passed as the first argument part_FUN to obtain data, which is then passed to fitGLS. Users are encouraged to use sample_partitions() to obtain a valid partmat.

The specific dimensions of partmat can have a substantial effect on the efficiency of fitGLS_partition. For most systems, we do not recommend fitting with partitions exceeding 3000 locations or pixels (i.e., partmat(partsize = 3000, ...)). Any larger, and the covariance matrix inversions may become quite slow (or impossible for some machines). It may help performance to use smaller even partitions of around 1000-2000 locations.

ncross determines how many partitions are used to estimate cross-partition statistics. All partitions, up to ncross are compared with all others in a pairwise fashion. There is no hard rule for setting mincross. More crosses will ensure convergence, but we believe that the default of 6 (10 total comparisons) should be sufficient for most moderate-sized maps if 1500-3000 pixel partitions are used. This may require testing with each individual dataset to determine at what point convergence occurs.

Covariance matrices for each partition are calculated with covar_FUN from distances among points within the partition. Parameter values for covar_FUN are given by covar.pars.

The distances among points are calculated with distm_FUN. distm_FUN can be any function, modeled after geosphere::distm(), that satisfies both: 1) returns a distance matrix among points when a single coordinate matrix is given as first argument; and 2) returns a matrix containing distances between two coordinate matrices if given as the first and second arguments.

If nugget = NA, a ML nugget is obtained for each partition. Otherwise, a fixed nugget is used for all partitions.

It is not required to use all partitions for cross-partition calculations, nor is it recommended to do so for most large data sets.

If progressbar = TRUE a text progress bar shows the current status of the calculations in the console.

Value

a "MC_partGLS", which is a precursor to a "partGLS" object

a "partGLS" object

"partGLS" object

fitGLS_partition returns a list object of class "partGLS" which contains at least the following elements:

call: the function call
GLS: an optional list of "remoteGLS" objects, one for each partition
part: statistics calculated from each partition: see below for further details
cross: statistics calculated from each pair of crossed partitions, determined by ncross: see below for further details
overall: summary statistics of the overall model: see below for further details

part is a sub-list containing the following elements

coefficients: a numeric matrix of GLS coefficients for each partition
SEs: a numeric matrix of coefficient standard errors
tstats: a numeric matrix of coefficient t-statstitics
pvals_t: a numeric matrix of t-test pvalues
nuggets: a numeric vector of nuggets for each partition
covar.pars: covar.pars input vector
modstats: a numeric matrix with rows corresponding to partitions and columns corresponding to log-likelihoods (logLik), sum of square error (SSE), mean-squared error (MSE), regression mean-square (MSR), F-statistics (Fstat), and p-values from F-tests (pval_F)

cross is a sub-list containing the following elements, which are use to calculate the combined (across partitions) standard errors of the coefficient estimates and statistical tests. See Ives et al. (2022).

rcoefs: a numeric matrix of cross-partition correlations in the estimates of the coefficients
rSSRs: a numeric vector of cross-partition correlations in the regression sum of squares
rSSEs: a numeric vector of cross-partition correlations in the sum of squared errors

and overall is a sub-list containing the elements

coefficients: a numeric vector of the average coefficient estimates across all partitions
rcoefficients: a numeric vector of the average cross-partition coefficient from across all crosses
rSSR: the average cross-partition correlation in the regression sum of squares
rSSE: the average cross-partition correlation in the sum of squared errors
Fstat: the average f-statistic across partitions
dfs: degrees of freedom to be used with partitioned GLS f-test
partdims: dimensions of partmat
pval.chisqr: if chisqr.test = TRUE, a p-value for the correlated chi-squared test
t.test: if do.t.test = TRUE, a table with t-test results, including the coefficient estimates, standard errors, t-statistics, and p-values

part_data and part_csv both return a list with two elements:

data: a dataframe, containing the data subset
coords: a coordinate matrix for the subset

parallel implementation

In order to be efficient and account for different user situations, parallel processing is available natively in fitGLS_partition. There are a few different specifications that will result in different behavior:

When parallel = TRUE and ncores > 1, all calculations are done completely in parallel (via multicore_fitGLS_partition()). In this case, parallelization is implemented with the parallel, doParallel, and foreach packages. In this version, all matrix operations are serialized on each worker but multiple operations can occur simultaneously..

When parallel = FALSE and ncores > 1, then most calculations are done on a single core but matrix opperations use multiple cores. In this case, ncores is passed to fitGLS. In this option, it is suggested to not exceed the number of physical cores (not threads).

When ncores <= 1, then the calculations are completely serialized

When ncores = NA (the default), only one core is used.

In the parallel implementation of this function, a progress bar is not possible, so progressbar is ignored.

`part_FUN` details

part_FUN can be any function that satisfies the following criteria

1. the first argument of part_FUN must accept an index of pixels by which to subset the data;

2. part_FUN must also accept formula and formula0 from fitGLS_partition; and

3. the output of part_FUN must be a list with at least the following elements, which are passed to fitGLS;

data: a data frame containing all variables given by formula. Rows should correspond to pixels specified by the first argument
coords: a coordinate matrix or data frame. Rows should correspond to pixels specified by the first argument

Two functions that satisfy these criteria are provided by remotePARTS: part_data and part_csv.

part_data uses an in-memory data frame (data) as a data source. part_csv, instead reads data from a csv file (file), one partition at a time, for efficient memory usage. part_csv internally calls sqldf::read.csv.sql() for fast and efficient row extraction.

Both functions use index to subset rows of data and formula and formula0 (optional) to determine which variables to select.

Both functions also use coord.names to indicate which variables contain spatial coordinates. The name of the x-coordinate column should always preceed the y-coordinate column: c("x", "y").

Users are encouraged to write their own part_FUN functions to meet their needs. For example, one might be interested in using data stored in a raster stack or any other file type. In this case, a user-defined part_FUN function allows access to fitGLS_partition without saving reformatted copies of data.

References

Ives, A. R., L. Zhu, F. Wang, J. Zhu, C. J. Morrow, and V. C. Radeloff. in review. Statistical tests for non-independent partitions of large autocorrelated datasets. MethodsX.

Examples

## read data
data(ndvi_AK10000)
df = ndvi_AK10000[seq_len(1000), ] # first 1000 rows

## create partition matrix
pm = sample_partitions(nrow(df), npart = 3)

## fit GLS with fixed nugget
partGLS = fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm,
                           data = df, nugget = 0, do.t.test = TRUE)

## hypothesis tests
chisqr(partGLS) # explanatory power of model
t.test(partGLS) # significance of predictors

## now with a numeric predictor
fitGLS_partition(formula = CLS_coef ~ lat, partmat = pm, data = df, nugget = 0)


## fit ML nugget for each partition (slow)
(partGLS.opt = fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm,
                                data = df, nugget = NA))
partGLS.opt$part$nuggets # ML nuggets

# Certain model structures may not be useful:
## 0 intercept with numeric predictor (produces NAs) and gives a warning in statistical tests
fitGLS_partition(formula = CLS_coef ~ 0 + lat, partmat = pm, data = df, nugget = 0)

## intercept-only, gives warning
fitGLS_partition(formula = CLS_coef ~ 1, partmat = pm, data = df, nugget = 0,
                 do.chisqr.test = FALSE)

## part_data examples
part_data(1:20, CLS_coef ~ 0 + land, data = ndvi_AK10000)


## part_csv examples - ## CAUTION: examples for part_csv() include manipulation side-effects:
# first, create a .csv file from ndviAK
data(ndvi_AK10000)
file.path = file.path(tempdir(), "ndviAK10000-remotePARTS.csv")
write.csv(ndvi_AK10000, file = file.path)

# build a partition from the first 30 pixels in the file
part_csv(1:20, formula = CLS_coef ~ 0 + land, file = file.path)

# now with a random 20 pixels
part_csv(sample(3000, 20), formula = CLS_coef ~ 0 + land, file = file.path)

# remove the example csv file from disk
file.remove(file.path)

[Package remotePARTS version 1.0.4 Index]