mmdml {dmlalg} | R Documentation |
Estimating linear coefficients in partially linear mixed-effects models with repeated measurements using double machine learning.
Description
Our goal is to perform inference for the linear parameter in partially linear mixed-effects models (PLMMs) with repeated measurements using double machine learning (DML).
The function mmdml
estimates the linear parameter \beta_0
in the PLMM
Y_i = X_i\beta_0 + g(W_i) + Z_ib_i + \epsilon_{Y_i},
(i = 1, \ldots, N)
for the continuous response Y_i
with linear covariates
X_i
, nonlinear covariates W_i
, unobserved random effects
b_i
, and the error term \epsilon_{Y_i}
.
For each i
, there are n_i
repeated observations available.
That is, Y_i
is an n_i
-dimensional vector.
The matrix Z_i
is fixed. The random effects b_i
and the
error terms \epsilon_{Y_i}
are Gaussian distributed, independent,
and independent of b_j
and \epsilon_{Y_j}
, respectively,
for i\neq j
. The linear and nonlineare covariates X_i
and
W_i
are random and independent of all random effects and error terms.
The linear parameter \beta_0
can be estimated with
a linear mixed-effects modeling approach with maximum likelihood
after regressing out W_i
nonparametrically from Y_i
and X_i
using machine learning
algorithms.
A linear mixed-effects model is estimated on the partialled-out data
Y_i - E[Y_i|W_i] = (X_i - E[X_i|W_i])\beta_0 + Z_ib_i
+ \epsilon_{Y_i}.
The conditional expectations are estimated with machine learning algorithms
and sample splitting, and cross-fitting is used to regain full efficiency
of the estimator of beta_0
. This estimator
is asymptotically Gaussian distributed and efficient.
Usage
mmdml(
w, x, y, z, data = NULL,
z_formula = NULL, group = "group",
K = 2L, S = 100L,
cond_method = rep("forest", 2),
params = NULL,
parallel = c("no", "multicore", "snow"),
ncpus = 1L, cl = NULL,
nr_random_eff = if (S > 5) 1L else S,
nr_res = nr_random_eff
)
Arguments
w |
A vector, matrix, or data frame. Its columns contain observations
of the nonlinear predictors. Alternatively, if the data is
provided in the data frame |
x |
A vector, matrix, or data frame. This is the linear predictor.
Alternatively, if the data is provided in the data frame
|
y |
A vector, matrix, or data frame. This is the response.
Alternatively, if the data is provided in the data frame
|
z |
A vector, matrix, or data frame. It acts as the fixed matrix
assigning the random effects.
Alternatively, if the data is
provided in the data frame |
z_formula |
A string specifying the structure of the random effect
using the notation as in |
group |
A string containing the name of the grouping variable in
|
data |
An optional data frame. If it is specified, its column names
need to coincide with the character vectors specified in |
K |
The number of sample splits used for cross-fitting. |
S |
Number of replications to correct for the random
splitting of the sample. It is set to |
cond_method |
A character vector of length 2 specifying the estimation
methods used to fit the conditional
expectations |
params |
An optional list of length 2. The 2 elements of this list
are lists themselves. These lists specify additional input arguments for
estimating the conditional expectations |
parallel |
One out of |
ncpus |
An integer specifying the number of cores used if
|
cl |
An optional parallel or snow cluster if |
nr_random_eff |
An integer specifying the number of unaggregated sets
of random effect estimates among the |
nr_res |
An integer specifying the number of unaggregated sets
of residual estimates among the |
Details
The estimator of \beta_0
is computed using sample splitting and
cross-fitting.
The subject-specific data (over i = 1, \ldots, N
) is split
into K
sets that are equally large
if possible. For each such set, the nuisance parameters
(that is, the conditional expectations E[Y_i|W_i]
and
E[X_i|W_i]
) are estimated on its complement and evaluated on the
set itself.
Estimators of \beta_0
and the variance parameters are computed
for each
of the K
data sets and are then averaged.
If K = 1
, no sample splitting
is performed. In this case, the nuisance parameters are estimated and
predicted on the full sample.
The whole estimation procedure is repeated S
times to
account for the randomness introduced by the random sample splits.
The S
repetitions can be run in parallel by specifying the
arguments parallel
and ncpus
.
The S
estimators of \beta_0
and the variance components
are aggregated by taking the
median of them. The S
variance-covariance matrices of the estimator
of \beta_0
are aggregated
by first adding a correction term to them that accounts for the random
splitting and by afterwards taking the median of the corrected
variance-covariance matrices. If d > 1
, it can happen that this
final matrix is not positive definite anymore, in which case the mean
is considered instead.
Estimates of the conditional random effects and the residuals are computed
in each of the S
repetitions. A total number of nr_random_eff
and nr_res
of them, respectively, is returned.
Additionally, the random effects estimates from all S
repetitions
are aggregated using the mean and returned.
If the design in at least 0.5 * S
of the S
repetitions is
singular, an error message is displayed.
If the designs in some but less than 0.5 * S
of the S
repetitions are singular, another S
repetitions are performed.
If, in total, at least S
repetitions result in a nonsingular design,
the results are returned together with a warning message.
The default options of the "spline"
, "forest"
,
"ols"
, "lasso"
, "ridge"
, and "elasticnet"
methods are as follows. With the "spline"
method,
the function bs
from the package splines
is employed
with degree = 3
and df = ceiling(N ^ (1 / 5)) + 2
if
N
satisfies (df + 1) * v + 1 > N
, where v
denotes
the number of columns of w
and N
denotes the sample size.
Otherwise, df
is consecutively
reduced by 1
until this condition is satisfied.
The splines are fitted and predicted on different data sets.
If they are extrapolated, a warning message is displayed.
With the "forest"
method, the function randomForest
from
the package randomForest
is employed with nodesize = 5
,
ntree = 500
, na.action = na.omit
, and replace = TRUE
.
With the "ols"
method, the default arguments are used and no
additional arguments are specified.
With the "lasso"
and "ridge"
methods,
the function cv.glmnet
from the package glmnet
performs
10-fold cross validation by default (argument nfolds
)
to find the one-standard-error-rule \lambda
-parameter.
With the "elasticnet"
method, the function cv.glmnet
from
the package glmnet
performs 10-fold cross validation
(argument nfolds
) with alpha = 0.5
by default
to find the one-standard-error-rule \lambda
-parameter.
All default values of the mentioned parameters can be adapted by
specifying the argument params
.
There are three possibilities to set the argument parallel
, namely
"no"
for serial evaluation (default),
"multicore"
for parallel evaluation using forking,
and "snow"
for parallel evaluation using a parallel
socket cluster. It is recommended to select RNGkind
("L'Ecuyer-CMRG"
) and to set a seed to ensure that the parallel
computing of the package dmlalg
is reproducible.
This ensures that each processor receives a different substream of the
pseudo random number generator stream.
Thus, the results reproducible if the arguments remain unchanged.
There is an optional argument cl
to specify a custom cluster
if parallel = "snow"
.
The response y
needs to be continuous.
The covariate w
may contain factor variables in its columns.
If the variable x
contains factor variables,
the factors should not be included as factor columns of x
.
Instead, dummy encoding should be used for all individual levels of the
factor.
That is, a factor with 4 levels should be encoded with 4 columns where each
column consists of 1 and 0 entries indicating the presence of the
respective level of the factor.
There are confint
, fixef
, print
, ranef
,
residuals
, sigma
, summary
, vcov
,
and VarCorr
methods available
for objects fitted with mmdml
. They are called
confint.mmdml
,
fixef.mmdml
,
print.mmdml
,
ranef.mmdml
,
residuals.mmdml
,
sigma.mmdml
,
summary.mmdml
,
vcov.mmdml
, and
VarCorr.mmdml
, respectively.
Value
A list similar to the output of lmer
from package lme4 containing
the following entries.
beta |
Estimator of the linear coefficient |
vcov |
Variance-covariance matrix of |
sigma |
Please see |
theta |
Please see |
varcor |
Variance correlation components computed with
|
random_eff |
Conditional estimates of the random effects
similarly to |
random_eff_all |
The first |
residuals |
The first |
The other elements ngrps
, nobs
, fitMsgs
, cnms
,
nc
, nms
, useSc
, optinfo
, and methTitle
are as in lmer
.
The gradient and Hessian information of optinfo
is computed
by aggregating the respective information over the S
repetitions
with the median.
References
C. Emmenegger and P. Bühlmann. Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements. Preprint arXiv:2108.13657.
See Also
confint
,
fixef
,
print
,
ranef
,
residuals
,
sigma
,
summary
,
vcov
,
VarCorr
Examples
## generate data
RNGkind("L'Ecuyer-CMRG")
set.seed(19)
data1 <- example_data_mmdml(beta0 = 0.2)
data2 <- example_data_mmdml(beta0 = c(0.2, 0.2))
## fit models
## Caveat: Warning messages are displayed because the small number of
## observations results in a singular random effects model
fit1 <-
mmdml(w = c("w1", "w2", "w3"), x = "x1", y = "resp", z = c("id", "cask"),
data = data1, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3)
fit2 <-
mmdml(w = c("w1", "w2", "w3"), x = c("x1", "x2"), y = "resp", z = c("id", "cask"),
data = data2, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3)
## apply methods
confint(fit2)
fixef(fit2)
print(fit2)
ranef(fit2)
residuals(fit2)
sigma(fit2)
summary(fit2)
vcov(fit2)
VarCorr(fit2)