MC_GLSpart {remotePARTS} | R Documentation |
fit a parallel partitioned GLS
Description
fit a GLS model to a large data set by partitioning the data into smaller pieces (partitions) and processing these pieces individually and summarizing output across partitions to conduct hypothesis tests.
Usage
MC_GLSpart(
formula,
partmat,
formula0 = NULL,
part_FUN = "part_data",
distm_FUN = "distm_scaled",
covar_FUN = "covar_exp",
covar.pars = c(range = 0.1),
nugget = NA,
ncross = 6,
save.GLS = FALSE,
ncores = parallel::detectCores(logical = FALSE) - 1,
debug = FALSE,
...
)
MCGLS_partsummary(
MCpartGLS,
covar.pars = c(range = 0.1),
save.GLS = FALSE,
partsize
)
multicore_fitGLS_partition(
formula,
partmat,
formula0 = NULL,
part_FUN = "part_data",
distm_FUN = "distm_scaled",
covar_FUN = "covar_exp",
covar.pars = c(range = 0.1),
nugget = NA,
ncross = 6,
save.GLS = FALSE,
ncores = parallel::detectCores(logical = FALSE) - 1,
do.t.test = TRUE,
do.chisqr.test = TRUE,
debug = FALSE,
...
)
fitGLS_partition(
formula,
partmat,
formula0 = NULL,
part_FUN = "part_data",
distm_FUN = "distm_scaled",
covar_FUN = "covar_exp",
covar.pars = c(range = 0.1),
nugget = NA,
ncross = 6,
save.GLS = FALSE,
do.t.test = TRUE,
do.chisqr.test = TRUE,
progressbar = TRUE,
debug = FALSE,
ncores = NA,
parallel = TRUE,
...
)
part_data(index, formula, data, formula0 = NULL, coord.names = c("lng", "lat"))
part_csv(index, formula, file, formula0 = NULL, coord.names = c("lng", "lat"))
Arguments
formula |
a formula for the GLS model |
partmat |
a numeric partition matrix, with values containing indices of locations. |
formula0 |
an optional formula for the null GLS model |
part_FUN |
a function to partition individual data. See details for more information about requirements for this function. |
distm_FUN |
a function to calculate distances from a coordinate matrix |
covar_FUN |
a function to calculate covariances from a distance matrix |
covar.pars |
a named list of parameters passed to |
nugget |
a numeric fixed nugget component: if NA, the nugget is estimated for each partition |
ncross |
an integer indicating the number of partitions used to calculate cross-partition statistics |
save.GLS |
logical: should full GLS output be saved for each partition? |
ncores |
an optional integer indicating how many CPU threads to use for calculations. |
debug |
logical debug mode |
... |
arguments passed to |
MCpartGLS |
object resulting from MC_partGLS() |
partsize |
number of locations per partition |
do.t.test |
logical: should a t-test of the GLS coefficients be conducted? |
do.chisqr.test |
logical: should a correlated chi-squared test of the model fit be conducted? |
progressbar |
logical: should progress be tracked with a progress bar? |
parallel |
logical: should all calculations be done in parallel? See details for more information |
index |
a vector of pixels with which to subset the data |
data |
a data frame |
coord.names |
a vector containing names of spatial coordinate variables (x and y, respectively) |
file |
a text string indicating the csv file from which to read data |
Details
The function specified by part_FUN
is called internally to obtain
properly formatted subsets of the full data (i.e., partitions). Two functions
are provided in the remotePARTs
package for this purpose: part_data
and part_csv
. Both of these functions have required arguments that
must be specified through the call to fitGLS_partition
(via ...
).
Check each function's argument list and see "part_FUN
details" below
for more information.
partmat
is used to partition the data. partmat
must be a complete
matrix, without any missing or non-finite values. Columns of partmat
are
passed as the first argument part_FUN
to obtain data, which is then
passed to fitGLS
. Users are encouraged to use sample_partitions()
to obtain a valid partmat
.
The specific dimensions of partmat
can have a substantial effect on the
efficiency of fitGLS_partition
. For most systems, we do not recommend
fitting with partitions exceeding 3000 locations or pixels
(i.e., partmat(partsize = 3000, ...)
). Any larger, and the covariance
matrix inversions may become quite slow (or impossible for some machines).
It may help performance to use smaller even partitions of around 1000-2000
locations.
ncross
determines how many partitions are used to estimate cross-partition
statistics. All partitions, up to ncross
are compared with all others
in a pairwise fashion. There is no hard rule for setting mincross
. More
crosses will ensure convergence, but we believe that the default of 6
(10 total comparisons) should be sufficient for most moderate-sized maps
if 1500-3000 pixel partitions are used. This may require testing with each
individual dataset to determine at what point convergence occurs.
Covariance matrices for each partition are calculated with covar_FUN
from distances among points within the partition. Parameter values for
covar_FUN
are given by covar.pars
.
The distances among points are calculated with distm_FUN
.
distm_FUN
can be any function, modeled after geosphere::distm()
,
that satisfies both: 1) returns a distance matrix among points when a single
coordinate matrix is given as first argument; and 2) returns a matrix
containing distances between two coordinate matrices if given as the first and
second arguments.
If nugget = NA
, a ML nugget is obtained for each partition. Otherwise,
a fixed nugget is used for all partitions.
It is not required to use all partitions for cross-partition calculations, nor is it recommended to do so for most large data sets.
If progressbar = TRUE
a text progress bar shows the current status
of the calculations in the console.
Value
a "MC_partGLS", which is a precursor to a "partGLS" object
a "partGLS" object
"partGLS" object
fitGLS_partition
returns a list object of class "partGLS" which
contains at least the following elements:
- call
the function call
- GLS
an optional list of "remoteGLS" objects, one for each partition
- part
statistics calculated from each partition: see below for further details
- cross
statistics calculated from each pair of crossed partitions, determined by
ncross
: see below for further details- overall
summary statistics of the overall model: see below for further details
part
is a sub-list containing the following elements
- coefficients
a numeric matrix of GLS coefficients for each partition
- SEs
a numeric matrix of coefficient standard errors
- tstats
a numeric matrix of coefficient t-statstitics
- pvals_t
a numeric matrix of t-test pvalues
- nuggets
a numeric vector of nuggets for each partition
- covar.pars
covar.pars
input vector- modstats
a numeric matrix with rows corresponding to partitions and columns corresponding to log-likelihoods (
logLik
), sum of square error (SSE
), mean-squared error (MSE
), regression mean-square (MSR
), F-statistics (Fstat
), and p-values from F-tests (pval_F
)
cross
is a sub-list containing the following elements, which are use
to calculate the combined (across partitions) standard errors of the coefficient
estimates and statistical tests. See Ives et al. (2022).
- rcoefs
a numeric matrix of cross-partition correlations in the estimates of the coefficients
- rSSRs
a numeric vector of cross-partition correlations in the regression sum of squares
- rSSEs
a numeric vector of cross-partition correlations in the sum of squared errors
and overall
is a sub-list containing the elements
- coefficients
a numeric vector of the average coefficient estimates across all partitions
- rcoefficients
a numeric vector of the average cross-partition coefficient from across all crosses
- rSSR
the average cross-partition correlation in the regression sum of squares
- rSSE
the average cross-partition correlation in the sum of squared errors
- Fstat
the average f-statistic across partitions
- dfs
degrees of freedom to be used with partitioned GLS f-test
- partdims
dimensions of
partmat
- pval.chisqr
if
chisqr.test = TRUE
, a p-value for the correlated chi-squared test- t.test
if
do.t.test = TRUE
, a table with t-test results, including the coefficient estimates, standard errors, t-statistics, and p-values
part_data
and part_csv
both return a list with two elements:
- data
a dataframe, containing the data subset
- coords
a coordinate matrix for the subset
parallel implementation
In order to be efficient and account for different user situations, parallel
processing is available natively in fitGLS_partition
. There are a few
different specifications that will result in different behavior:
When parallel = TRUE
and ncores > 1
, all calculations are done
completely in parallel (via multicore_fitGLS_partition()
).
In this case, parallelization is implemented with the
parallel
, doParallel
, and foreach
packages. In this version,
all matrix operations are serialized on each worker but multiple operations
can occur simultaneously..
When parallel = FALSE
and ncores > 1
, then most calculations
are done on a single core but matrix opperations use multiple cores. In this
case, ncores
is passed to fitGLS. In this option, it is suggested
to not exceed the number of physical cores (not threads).
When ncores <= 1
, then the calculations are completely serialized
When ncores = NA
(the default), only one core is used.
In the parallel implementation of this function, a progress bar is not possible,
so progressbar
is ignored.
part_FUN
details
part_FUN
can be any function that satisfies the following criteria
1. the first argument of part_FUN
must accept an index of pixels by which
to subset the data;
2. part_FUN
must also accept formula
and formula0
from
fitGLS_partition
; and
3. the output of part_FUN
must be a list with at least the
following elements, which are passed to fitGLS
;
- data
a data frame containing all variables given by
formula
. Rows should correspond to pixels specified by the first argument- coords
a coordinate matrix or data frame. Rows should correspond to pixels specified by the first argument
Two functions that satisfy these criteria are provided by remotePARTS
:
part_data
and part_csv
.
part_data
uses an in-memory data frame (data
)
as a data source. part_csv
, instead reads data from a
csv file (file
), one partition at a time, for efficient memory usage.
part_csv
internally calls sqldf::read.csv.sql()
for fast and
efficient row extraction.
Both functions use index
to subset rows of data and formula
and
formula0
(optional) to determine which variables to select.
Both functions also use coord.names
to indicate which variables contain
spatial coordinates. The name of the x-coordinate column should always preceed
the y-coordinate column: c("x", "y")
.
Users are encouraged to write their own part_FUN
functions to meet their
needs. For example, one might be interested in using data stored in a raster
stack or any other file type. In this case, a user-defined part_FUN
function allows access to fitGLS_partition
without saving reformatted
copies of data.
References
Ives, A. R., L. Zhu, F. Wang, J. Zhu, C. J. Morrow, and V. C. Radeloff. in review. Statistical tests for non-independent partitions of large autocorrelated datasets. MethodsX.
See Also
Other partitionedGLS:
crosspart_GLS()
,
sample_partitions()
Other partitionedGLS:
crosspart_GLS()
,
sample_partitions()
Other partitionedGLS:
crosspart_GLS()
,
sample_partitions()
Examples
## read data
data(ndvi_AK10000)
df = ndvi_AK10000[seq_len(1000), ] # first 1000 rows
## create partition matrix
pm = sample_partitions(nrow(df), npart = 3)
## fit GLS with fixed nugget
partGLS = fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm,
data = df, nugget = 0, do.t.test = TRUE)
## hypothesis tests
chisqr(partGLS) # explanatory power of model
t.test(partGLS) # significance of predictors
## now with a numeric predictor
fitGLS_partition(formula = CLS_coef ~ lat, partmat = pm, data = df, nugget = 0)
## fit ML nugget for each partition (slow)
(partGLS.opt = fitGLS_partition(formula = CLS_coef ~ 0 + land, partmat = pm,
data = df, nugget = NA))
partGLS.opt$part$nuggets # ML nuggets
# Certain model structures may not be useful:
## 0 intercept with numeric predictor (produces NAs) and gives a warning in statistical tests
fitGLS_partition(formula = CLS_coef ~ 0 + lat, partmat = pm, data = df, nugget = 0)
## intercept-only, gives warning
fitGLS_partition(formula = CLS_coef ~ 1, partmat = pm, data = df, nugget = 0,
do.chisqr.test = FALSE)
## part_data examples
part_data(1:20, CLS_coef ~ 0 + land, data = ndvi_AK10000)
## part_csv examples - ## CAUTION: examples for part_csv() include manipulation side-effects:
# first, create a .csv file from ndviAK
data(ndvi_AK10000)
file.path = file.path(tempdir(), "ndviAK10000-remotePARTS.csv")
write.csv(ndvi_AK10000, file = file.path)
# build a partition from the first 30 pixels in the file
part_csv(1:20, formula = CLS_coef ~ 0 + land, file = file.path)
# now with a random 20 pixels
part_csv(sample(3000, 20), formula = CLS_coef ~ 0 + land, file = file.path)
# remove the example csv file from disk
file.remove(file.path)