mrfrequentist {mrregression} | R Documentation |
Fitting frequentist linear models using Merge and Reduce
Description
mrfrequentist
is used to conduct frequentist linear
regression on very large data sets using Merge and Reduce as
described in Geppert et al. (2020).
Usage
mrfrequentist(
formula,
fileMr = NULL,
dataMr = NULL,
obsPerBlock,
approach = c("1", "3"),
sep = "auto",
dec = ".",
header = TRUE,
naStrings = "NA",
colNames = NULL,
naAction = na.fail
)
Arguments
formula |
(formula)
See formula . Note that mrfrequentist currently
supports numeric predictors only.
|
fileMr |
(character ) The name of a file, including the
filepath, to be read in blockwise. Either fileMr or dataMr
needs to be specified. When using this argument, the arguments sep ,
dec , header , naStrings , colNames (as in fread )
are of relevance. Further options from fread are currently not supported.
Also note that defaults might differ. In case the data to be read in has row names,
note that these will be read in as regular column. This may need
special treatment.
|
dataMr |
(data.frame ) The data to be used for the regression
analysis. Either fileMr or dataMr needs to be specified.
Note that the arguments sep , dec , header , naStrings ,
and colNames are ignored when dataMr is specified.
|
obsPerBlock |
(numeric) Value specifying the number of
observations in each block. This number has to be larger than the number of
regression coefficients. Moreover, for approach 1 the recommended ratio of
observations per regression coefficient is larger than 25 (Geppert et al., 2020).
Note that the last block may contain less observations than specified
depending on the sample size. If the number of observations in this last
block is too small it is not included in the model and a warning is
issued.
|
approach |
(character) Approach specifying the merge
technique. One of either "1" or "3". Approach "1" is based on a weighted
mean procedure whereas approach "3" is an exact method based on blockwise
calculations of X'X, y'X and y'y. See Geppert et al. (2020) for details on
the approaches and section Details below for comments on approach "3".
|
sep |
See documentation of fread . Default is
"auto". Ignored when dataMr is specified.
|
dec |
See documentation of fread . Default is
".". Ignored when dataMr is specified.
|
|
(logical) See documentation of
fread . Defaults to TRUE . Ignored when
dataMr is specified. If header is set to FALSE and no
colNames are given, then column names default to "V" followed by the
column number.
|
naStrings |
(character) Optional argument.
See argument na.strings of fread .
Default is "NA". Ignored when dataMr is specified and optional
when fileMr is used.
|
colNames |
(character vector) Same as argument
col.names of fread . Ignored when dataMr is
specified and optional when fileMr is used.
|
naAction |
(function) Action to be taken when missing values
are present in the data. Currently only na.fail is
supported.
|
Value
Returns an object of class "mrfrequentist"
which is a list
containing the following components for both approaches "1" and "3":
approach |
The approach used for merging the models. Either "1" or "3".
|
formula |
The model's formula .
|
level |
Number of level of the final model in Merge and Reduce. This is equal
to \lceil \log_2{(\code{numberObs}/\code{obsPerBlock})} \rceil + 1
and corresponds to the number of buckets in Figure 1 of Geppert et al. (2020).
|
numberObs |
The total number of observations.
|
summaryStats |
Summary statistics reporting the estimated regression coefficients
and their unbiased standard errors. Estimates are based
on the merge technique as specified in the argument approach .
For approach "1" the estimates of the standard errors are corrected
dividing by \sqrt \lceil \code{numberObs / obsPerBlock} \rceil
. For further details see Geppert et al. (2020).
For approach "3" the unbiased estimates of the standard errors are given.
|
dataHead |
First six rows of the data in the first block. This serves
as a sanity check, especially when using the argument fileMr .
|
terms |
Terms object.
|
Additionally for approach "3" only:
XTX |
The final model's crossprod(X, X) .
|
yTX |
The final model's crossprod(y, X) .
|
yTy |
The final model's crossprod(y, y) .
|
Details
In approach "3" the estimated regression coefficients and their unbiased standard errors
are calculated via qr decompositions on X'X (as in speedlm
with argument method = "qr"
). Moreover, the merge step uses the same
idea of blockwise addition for X'X, y'y and y'X as speedglm
's updating
procedure updateWithMoreData
. Conceptually though,
Merge and Reduce is not an updating algorithm as it merges models based on
a comparable amount of data along a tree structure to obtain a final model.
References
Geppert, L.N., Ickstadt, K., Munteanu, A., & Sohler, C. (2020).
Streaming statistical models via Merge & Reduce. International Journal
of Data Science and Analytics, 1-17,
doi: https://doi.org/10.1007/s41060-020-00226-0
Examples
## run mrfrequentist() with dataMr
data(exampleData)
fit1 = mrfrequentist(dataMr = exampleData, approach = "1", obsPerBlock = 300,
formula = V11 ~ .)
## run mrfrequentist() with fileMr
filepath = system.file("extdata", "exampleFile.txt", package = "mrregression")
fit2 = mrfrequentist(fileMr = filepath, approach = "3", header = TRUE,
obsPerBlock = 100, formula = y ~ .)
[Package
mrregression version 1.0.0
Index]