mrbayes {mrregression} | R Documentation |
Bayesian linear regression using Merge and Reduce
Description
mrbayes
is used to conduct Bayesian linear regression on
very large data sets using Merge and Reduce as described in Geppert et al. (2020).
Package rstan
needs to be installed. When calling the function this
is checked using requireNamespace
as suggested by
Hadley Wickham in "R packages" (section Dependencies,
http://r-pkgs.had.co.nz/description.html, accessed 2020-07-31).
Usage
mrbayes(
y,
intercept = TRUE,
fileMr = NULL,
dataMr = NULL,
obsPerBlock,
dataStan = NULL,
sep = "auto",
dec = ".",
header = TRUE,
naStrings = "NA",
colNames = NULL,
naAction = na.fail,
...
)
Arguments
y |
|
intercept |
|
fileMr |
( |
dataMr |
( |
obsPerBlock |
|
dataStan |
( |
sep |
See documentation of |
dec |
See documentation of |
header |
|
naStrings |
|
colNames |
|
naAction |
|
... |
Further optional arguments to be passed on to
|
Value
Returns an object of class "mrbayes"
which is a list
containing the following components:
level |
Number of level of the final model in Merge and Reduce. This is equal
to |
numberObs |
The total number of observations. |
summaryStats |
Summary statistics including the mean, median, quartiles, 2.5% and 97.5% quantiles of the posterior distributions for each regression coefficient and the error term's standard deviation sigma. |
diagnostics |
Effective sample size (n_eff) and potential scale reduction factor on split chains (Rhat) calculated from the output of summary,stanfit-method. Note that, using Merge and Reduce, for each regression coefficient only one value is reported: For n_eff the minimum observed value on level 1 is reported and for Rhat the maximum observed value on level 1 is reported. |
modelCode |
The model. Syntax
as in argument |
dataHead |
First six rows of the data in the first block. This serves
as a sanity check, especially when using the argument |
Details
Code of default dataStan
makes use of all predictors:
dataStan = list(n = nrow(currentBlock),
d = (ncol(currentBlock) -
1),
X = currentBlock[, -colNumY],
y = currentBlock[, colNumY])
where currentBlock
is the current block of data to be evaluated, n
the number of observations,
d
the number of variables (without intercept), X
contains the predictors,
and y
the dependent variable. colNumY
is the column number of the
dependent variable that the function finds internally.
When specifying the argument dataStan
, note two things:
1. Please use the syntax of the default dataStan
, i.e. the object
containing the data of the block to be evaluated is called
currentBlock
, the number of observations must be set to
n = nrow(currentBlock)
, d
needs to be set to the number of
variables without intercept, the dependent variable must be named y
,
and the independent variables must be named X
.
2. The expressions
within the list must be unevaluated: Therefore, use the function
quote
.
References
Geppert, L.N., Ickstadt, K., Munteanu, A., & Sohler, C. (2020).
Streaming statistical models via Merge & Reduce. International Journal
of Data Science and Analytics, 1-17,
doi: https://doi.org/10.1007/s41060-020-00226-0
Examples
# Package rstan needs to be installed for running this example.
if (requireNamespace("rstan", quietly = TRUE)) {
n = 2000
p = 4
set.seed(34)
x1 = rnorm(n, 10, 2)
x2 = rnorm(n, 5, 3)
x3 = rnorm(n, -2, 1)
x4 = rnorm(n, 0, 5)
y = 2.4 - 0.6 * x1 + 5.5 * x2 - 7.2 * x3 + 5.7 * x4 + rnorm(n)
data = data.frame(x1, x2, x3, x4, y)
normalmodell = '
data {
int<lower=0> n;
int<lower=0> d;
matrix[n,d] X; // predictor matrix
vector[n] y; // outcome vector
}
parameters {
real alpha; // intercept
vector[d] beta; // coefficients for predictors
real<lower=0> sigma; // error scale
}
model {
y ~ normal(alpha + X * beta, sigma); // likelihood
}
'
datas = list(n = nrow(data), d = ncol(data)-1,
y = data[, dim(data)[2]], X = data[, 1:(dim(data)[2]-1)])
fit0 = rstan::stan(model_code = normalmodell, data = datas, chains = 4, iter = 1000)
fit1 = mrbayes(dataMr = data, obsPerBlock = 500, y = 'y')
}