MC_GLSpart {remotePARTS}R Documentation

fit a parallel partitioned GLS

Description

fit a GLS model to a large data set by partitioning the data into smaller pieces (partitions) and processing these pieces individually and summarizing output across partitions to conduct hypothesis tests.

Usage

MC_GLSpart(
  formula,
  partmat,
  formula0 = NULL,
  part_FUN = "part_data",
  distm_FUN = "distm_scaled",
  covar_FUN = "covar_exp",
  covar.pars = c(range = 0.1),
  nugget = NA,
  ncross = 6,
  save.GLS = FALSE,
  ncores = parallel::detectCores(logical = FALSE) - 1,
  debug = FALSE,
  ...
)

MCGLS_partsummary(
  MCpartGLS,
  covar.pars = c(range = 0.1),
  save.GLS = FALSE,
  partsize
)

multicore_fitGLS_partition(
  formula,
  partmat,
  formula0 = NULL,
  part_FUN = "part_data",
  distm_FUN = "distm_scaled",
  covar_FUN = "covar_exp",
  covar.pars = c(range = 0.1),
  nugget = NA,
  ncross = 6,
  save.GLS = FALSE,
  ncores = parallel::detectCores(logical = FALSE) - 1,
  do.t.test = TRUE,
  do.chisqr.test = TRUE,
  debug = FALSE,
  ...
)

fitGLS_partition(
  formula,
  partmat,
  formula0 = NULL,
  part_FUN = "part_data",
  distm_FUN = "distm_scaled",
  covar_FUN = "covar_exp",
  covar.pars = c(range = 0.1),
  nugget = NA,
  ncross = 6,
  save.GLS = FALSE,
  do.t.test = TRUE,
  do.chisqr.test = TRUE,
  progressbar = TRUE,
  debug = FALSE,
  ncores = NA,
  parallel = TRUE,
  ...
)

part_data(index, formula, data, formula0 = NULL, coord.names = c("lng", "lat"))

part_csv(index, formula, file, formula0 = NULL, coord.names = c("lng", "lat"))

Arguments

formula

a formula for the GLS model

partmat

a numeric partition matrix, with values containing indices of locations.

formula0

an optional formula for the null GLS model

part_FUN

a function to partition individual data. See details for more information about requirements for this function.

distm_FUN

a function to calculate distances from a coordinate matrix

covar_FUN

a function to calculate covariances from a distance matrix

covar.pars

a named list of parameters passed to covar_FUN

nugget

a numeric fixed nugget component: if NA, the nugget is estimated for each partition

ncross

an integer indicating the number of partitions used to calculate cross-partition statistics

save.GLS

logical: should full GLS output be saved for each partition?

ncores

an optional integer indicating how many CPU threads to use for calculations.

debug

logical debug mode

...

arguments passed to part_FUN

MCpartGLS

object resulting from MC_partGLS()

partsize

number of locations per partition

do.t.test

logical: should a t-test of the GLS coefficients be conducted?

do.chisqr.test

logical: should a correlated chi-squared test of the model fit be conducted?

progressbar

logical: should progress be tracked with a progress bar?

parallel

logical: should all calculations be done in parallel? See details for more information

index

a vector of pixels with which to subset the data

data

a data frame

coord.names

a vector containing names of spatial coordinate variables (x and y, respectively)

file

a text string indicating the csv file from which to read data

Details

The function specified by part_FUN is called internally to obtain properly formatted subsets of the full data (i.e., partitions). Two functions are provided in the remotePARTs package for this purpose: part_data and part_csv. Both of these functions have required arguments that must be specified through the call to fitGLS_partition (via ...). Check each function's argument list and see "part_FUN details" below for more information.

partmat is used to partition the data. partmat must be a complete matrix, without any missing or non-finite values. Columns of partmat are passed as the first argument part_FUN to obtain data, which is then passed to fitGLS. Users are encouraged to use sample_partitions() to obtain a valid partmat.

The specific dimensions of partmat can have a substantial effect on the efficiency of fitGLS_partition. For most systems, we do not recommend fitting with partitions exceeding 3000 locations or pixels (i.e., partmat(partsize = 3000, ...)). Any larger, and the covariance matrix inversions may become quite slow (or impossible for some machines). It may help performance to use smaller even partitions of around 1000-2000 locations.

ncross determines how many partitions are used to estimate cross-partition statistics. All partitions, up to ncross are compared with all others in a pairwise fashion. There is no hard rule for setting mincross. More crosses will ensure convergence, but we believe that the default of 6 (10 total comparisons) should be sufficient for most moderate-sized maps if 1500-3000 pixel partitions are used. This may require testing with each individual dataset to determine at what point convergence occurs.

Covariance matrices for each partition are calculated with covar_FUN from distances among points within the partition. Parameter values for covar_FUN are given by covar.pars.

The distances among points are calculated with distm_FUN. distm_FUN can be any function, modeled after geosphere::distm(), that satisfies both: 1) returns a distance matrix among points when a single coordinate matrix is given as first argument; and 2) returns a matrix containing distances between two coordinate matrices if given as the first and second arguments.

If nugget = NA, a ML nugget is obtained for each partition. Otherwise, a fixed nugget is used for all partitions.

It is not required to use all partitions for cross-partition calculations, nor is it recommended to do so for most large data sets.

If progressbar = TRUE a text progress bar shows the current status of the calculations in the console.

Value

a "MC_partGLS", which is a precursor to a "partGLS" object

a "partGLS" object

"partGLS" object

fitGLS_partition returns a list object of class "partGLS" which contains at least the following elements:

call

the function call

GLS

an optional list of "remoteGLS" objects, one for each partition

part

statistics calculated from each partition: see below for further details

cross

statistics calculated from each pair of crossed partitions, determined by ncross: see below for further details

overall

summary statistics of the overall model: see below for further details

part is a sub-list containing the following elements

coefficients

a numeric matrix of GLS coefficients for each partition

SEs

a numeric matrix of coefficient standard errors

tstats

a numeric matrix of coefficient t-statstitics

pvals_t

a numeric matrix of t-test pvalues

nuggets

a numeric vector of nuggets for each partition

covar.pars

covar.pars input vector

modstats

a numeric matrix with rows corresponding to partitions and columns corresponding to log-likelihoods (logLik), sum of square error (SSE), mean-squared error (MSE), regression mean-square (MSR), F-statistics (Fstat), and p-values from F-tests (pval_F)

cross is a sub-list containing the following elements, which are use to calculate the combined (across partitions) standard errors of the coefficient estimates and statistical tests. See Ives et al. (2022).

rcoefs

a numeric matrix of cross-partition correlations in the estimates of the coefficients

rSSRs

a numeric vector of cross-partition correlations in the regression sum of squares

rSSEs

a numeric vector of cross-partition correlations in the sum of squared errors

and overall is a sub-list containing the elements

coefficients

a numeric vector of the average coefficient estimates across all partitions

rcoefficients

a numeric vector of the average cross-partition coefficient from across all crosses

rSSR

the average cross-partition correlation in the regression sum of squares

rSSE

the average cross-partition correlation in the sum of squared errors

Fstat

the average f-statistic across partitions

dfs

degrees of freedom to be used with partitioned GLS f-test

partdims

dimensions of partmat

pval.chisqr

if chisqr.test = TRUE, a p-value for the correlated chi-squared test

t.test

if do.t.test = TRUE, a table with t-test results, including the coefficient estimates, standard errors, t-statistics, and p-values

part_data and part_csv both return a list with two elements:

data

a dataframe, containing the data subset

coords

a coordinate matrix for the subset

parallel implementation

In order to be efficient and account for different user situations, parallel processing is available natively in fitGLS_partition. There are a few different specifications that will result in different behavior:

When parallel = TRUE and ncores > 1, all calculations are done completely in parallel (via multicore_fitGLS_partition()). In this case, parallelization is implemented with the parallel, doParallel, and foreach packages. In this version, all matrix operations are serialized on each worker but multiple operations can occur simultaneously..

When parallel = FALSE and ncores > 1, then most calculations are done on a single core but matrix opperations use multiple cores. In this case, ncores is passed to fitGLS. In this option, it is suggested to not exceed the number of physical cores (not threads).

When ncores <= 1, then the calculations are completely serialized

When ncores = NA (the default), only one core is used.

In the parallel implementation of this function, a progress bar is not possible, so progressbar is ignored.

part_FUN details

part_FUN can be any function that satisfies the following criteria

1. the first argument of part_FUN must accept an index of pixels by which to subset the data;

2. part_FUN must also accept formula and formula0 from fitGLS_partition; and

3. the output of part_FUN must be a list with at least the following elements, which are passed to fitGLS;

data

a data frame containing all variables given by formula. Rows should correspond to pixels specified by the first argument

coords

a coordinate matrix or data frame. Rows should correspond to pixels specified by the first argument

Two functions that satisfy these criteria are provided by remotePARTS: part_data and part_csv.

part_data uses an in-memory data frame (data) as a data source. part_csv, instead reads data from a csv file (file), one partition at a time, for efficient memory usage. part_csv internally calls sqldf::read.csv.sql() for fast and efficient row extraction.

Both functions use index to subset rows of data and formula and formula0 (optional) to determine which variables to select.

Both functions also use coord.names to indicate which variables contain spatial coordinates. The name of the x-coordinate column should always preceed the y-coordinate column: c("x", "y").

Users are encouraged to write their own part_FUN functions to meet their needs. For example, one might be interested in using data stored in a raster stack or any other file type. In this case, a user-defined part_FUN function allows access to fitGLS_partition without saving reformatted copies of data.

References

Ives, A. R., L. Zhu, F. Wang, J. Zhu, C. J. Morrow, and V. C. Radeloff. in review. Statistical tests for non-independent partitions of large autocorrelated datasets. MethodsX.

See Also

Other partitionedGLS: crosspart_GLS(), sample_partitions()

Other partitionedGLS: crosspart_GLS(), sample_partitions()

Other partitionedGLS: crosspart_GLS(), sample_partitions()

Examples

## read data
data(ndvi_AK10000)
df = ndvi_AK10000[seq_len(1000), ] # first 1000 rows

## create partition matrix
pm = sample_partitions(nrow(df), npart = 3)

## fit GLS with fixed nugget
partGLS = fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm,
                           data = df, nugget = 0, do.t.test = TRUE)

## hypothesis tests
chisqr(partGLS) # explanatory power of model
t.test(partGLS) # significance of predictors

## now with a numeric predictor
fitGLS_partition(formula = CLS_coef ~ lat, partmat = pm, data = df, nugget = 0)


## fit ML nugget for each partition (slow)
(partGLS.opt = fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm,
                                data = df, nugget = NA))
partGLS.opt$part$nuggets # ML nuggets

# Certain model structures may not be useful:
## 0 intercept with numeric predictor (produces NAs) and gives a warning in statistical tests
fitGLS_partition(formula = CLS_coef ~ 0 + lat, partmat = pm, data = df, nugget = 0)

## intercept-only, gives warning
fitGLS_partition(formula = CLS_coef ~ 1, partmat = pm, data = df, nugget = 0,
                 do.chisqr.test = FALSE)

## part_data examples
part_data(1:20, CLS_coef ~ 0 + land, data = ndvi_AK10000)


## part_csv examples - ## CAUTION: examples for part_csv() include manipulation side-effects:
# first, create a .csv file from ndviAK
data(ndvi_AK10000)
file.path = file.path(tempdir(), "ndviAK10000-remotePARTS.csv")
write.csv(ndvi_AK10000, file = file.path)

# build a partition from the first 30 pixels in the file
part_csv(1:20, formula = CLS_coef ~ 0 + land, file = file.path)

# now with a random 20 pixels
part_csv(sample(3000, 20), formula = CLS_coef ~ 0 + land, file = file.path)

# remove the example csv file from disk
file.remove(file.path)



[Package remotePARTS version 1.0.4 Index]