R: Estimating linear coefficients in partially linear...

mmdml {dmlalg}

R Documentation

Estimating linear coefficients in partially linear mixed-effects models with repeated measurements using double machine learning.

Description

Our goal is to perform inference for the linear parameter in partially linear mixed-effects models (PLMMs) with repeated measurements using double machine learning (DML).

The function mmdml estimates the linear parameter \beta_0 in the PLMM

Y_i = X_i\beta_0 + g(W_i) + Z_ib_i + \epsilon_{Y_i}, (i = 1, \ldots, N)

for the continuous response Y_i with linear covariates X_i, nonlinear covariates W_i, unobserved random effects b_i, and the error term \epsilon_{Y_i}. For each i, there are n_i repeated observations available. That is, Y_i is an n_i-dimensional vector. The matrix Z_i is fixed. The random effects b_i and the error terms \epsilon_{Y_i} are Gaussian distributed, independent, and independent of b_j and \epsilon_{Y_j}, respectively, for i\neq j. The linear and nonlineare covariates X_i and W_i are random and independent of all random effects and error terms.

The linear parameter \beta_0 can be estimated with a linear mixed-effects modeling approach with maximum likelihood after regressing out W_i nonparametrically from Y_i and X_i using machine learning algorithms. A linear mixed-effects model is estimated on the partialled-out data

Y_i - E[Y_i|W_i] = (X_i - E[X_i|W_i])\beta_0 + Z_ib_i + \epsilon_{Y_i}.

The conditional expectations are estimated with machine learning algorithms and sample splitting, and cross-fitting is used to regain full efficiency of the estimator of beta_0. This estimator is asymptotically Gaussian distributed and efficient.

Usage

mmdml(
  w, x, y, z, data = NULL,
  z_formula = NULL, group = "group",
  K = 2L, S = 100L,
  cond_method = rep("forest", 2),
  params = NULL,
  parallel = c("no", "multicore", "snow"),
  ncpus = 1L, cl = NULL,
  nr_random_eff = if (S > 5) 1L else S,
  nr_res = nr_random_eff
)

Arguments

`w`	A vector, matrix, or data frame. Its columns contain observations of the nonlinear predictors. Alternatively, if the data is provided in the data frame `data`, `w` is a character vector whose entries specify the columns of `data` acting as `W`.
`x`	A vector, matrix, or data frame. This is the linear predictor. Alternatively, if the data is provided in the data frame `data`, `x` is a character vector whose entries specify the columns of `data` acting as `X`.
`y`	A vector, matrix, or data frame. This is the response. Alternatively, if the data is provided in the data frame `data`, `y` is a character vector whose entries specify the columns of `data` acting as `Y`.
`z`	A vector, matrix, or data frame. It acts as the fixed matrix assigning the random effects. Alternatively, if the data is provided in the data frame `data`, `z` is a character vector whose entries specify the columns of `data` acting as `Z`.
`z_formula`	A string specifying the structure of the random effect using the notation as in `lmer` from package lme4, e.g., `(1\|id) + (1\|cask:id)`.
`group`	A string containing the name of the grouping variable in `zz`.
`data`	An optional data frame. If it is specified, its column names need to coincide with the character vectors specified in `a`, `w`, `x`, and `y`.
`K`	The number of sample splits used for cross-fitting.
`S`	Number of replications to correct for the random splitting of the sample. It is set to `100L` by default.
`cond_method`	A character vector of length 2 specifying the estimation methods used to fit the conditional expectations `E[X\|W]` and `E[Y\|W]`. Its components are from from `"spline"`, `"forest"`, `"ols"`, `"lasso"`, `"ridge"`, and `"elasticnet"`, or it is a list of length 2 with components from `"spline"`, `"forest"`, `"ols"`, `"lasso"`, `"ridge"`, and `"elasticnet"`, and where some components of the list are functions to estimate the conditional expectations. These functions have the input arguments `(yy_fit, ww_fit, ww_predict, params = NULL)` and output the conditional expectation of `E[Y\|W]` estimated with `yy_fit` and `ww_fit` and predicted with `ww_predict`. The argument `params` is described below. The functions return a matrix where the columns correspond to the component-wise estimated conditional expectations. Here, `yy` symbolically stands for either `x` or `y`. Please see below for the default arguments of the `"spline"`, `"forest"`, `"ols"`, `"lasso"`, `"ridge"`, and `"elasticnet"` methods.
`params`	An optional list of length 2. The 2 elements of this list are lists themselves. These lists specify additional input arguments for estimating the conditional expectations `E[X\|W]` and `E[Y\|W]`, respectively.
`parallel`	One out of `"no"`, `"multicore"`, or `"snow"` specifying the parallelization method used to compute the `S` replications. The default is `"no"`.
`ncpus`	An integer specifying the number of cores used if `parallel` is not set to `"no"`.
`cl`	An optional parallel or snow cluster if `parallel = "snow"`. The argument `ncpus` does not have to be specified if the argument `cl` is specified.
`nr_random_eff`	An integer specifying the number of unaggregated sets of random effect estimates among the `S` repetitions that should be returned.
`nr_res`	An integer specifying the number of unaggregated sets of residual estimates among the `S` repetitions that should be returned.

Details

The estimator of \beta_0 is computed using sample splitting and cross-fitting. The subject-specific data (over i = 1, \ldots, N) is split into K sets that are equally large if possible. For each such set, the nuisance parameters (that is, the conditional expectations E[Y_i|W_i] and E[X_i|W_i]) are estimated on its complement and evaluated on the set itself. Estimators of \beta_0 and the variance parameters are computed for each of the K data sets and are then averaged. If K = 1, no sample splitting is performed. In this case, the nuisance parameters are estimated and predicted on the full sample.

The whole estimation procedure is repeated S times to account for the randomness introduced by the random sample splits. The S repetitions can be run in parallel by specifying the arguments parallel and ncpus. The S estimators of \beta_0 and the variance components are aggregated by taking the median of them. The S variance-covariance matrices of the estimator of \beta_0 are aggregated by first adding a correction term to them that accounts for the random splitting and by afterwards taking the median of the corrected variance-covariance matrices. If d > 1, it can happen that this final matrix is not positive definite anymore, in which case the mean is considered instead. Estimates of the conditional random effects and the residuals are computed in each of the S repetitions. A total number of nr_random_eff and nr_res of them, respectively, is returned. Additionally, the random effects estimates from all S repetitions are aggregated using the mean and returned.

If the design in at least 0.5 * S of the S repetitions is singular, an error message is displayed. If the designs in some but less than 0.5 * S of the S repetitions are singular, another S repetitions are performed. If, in total, at least S repetitions result in a nonsingular design, the results are returned together with a warning message.

The default options of the "spline", "forest", "ols", "lasso", "ridge", and "elasticnet" methods are as follows. With the "spline" method, the function bs from the package splines is employed with degree = 3 and df = ceiling(N ^ (1 / 5)) + 2 if N satisfies (df + 1) * v + 1 > N, where v denotes the number of columns of w and N denotes the sample size. Otherwise, df is consecutively reduced by 1 until this condition is satisfied. The splines are fitted and predicted on different data sets. If they are extrapolated, a warning message is displayed. With the "forest" method, the function randomForest from the package randomForest is employed with nodesize = 5, ntree = 500, na.action = na.omit, and replace = TRUE. With the "ols" method, the default arguments are used and no additional arguments are specified. With the "lasso" and "ridge" methods, the function cv.glmnet from the package glmnet performs 10-fold cross validation by default (argument nfolds) to find the one-standard-error-rule \lambda-parameter. With the "elasticnet" method, the function cv.glmnet from the package glmnet performs 10-fold cross validation (argument nfolds) with alpha = 0.5 by default to find the one-standard-error-rule \lambda-parameter. All default values of the mentioned parameters can be adapted by specifying the argument params.

There are three possibilities to set the argument parallel, namely "no" for serial evaluation (default), "multicore" for parallel evaluation using forking, and "snow" for parallel evaluation using a parallel socket cluster. It is recommended to select RNGkind ("L'Ecuyer-CMRG") and to set a seed to ensure that the parallel computing of the package dmlalg is reproducible. This ensures that each processor receives a different substream of the pseudo random number generator stream. Thus, the results reproducible if the arguments remain unchanged. There is an optional argument cl to specify a custom cluster if parallel = "snow".

The response y needs to be continuous. The covariate w may contain factor variables in its columns. If the variable x contains factor variables, the factors should not be included as factor columns of x. Instead, dummy encoding should be used for all individual levels of the factor. That is, a factor with 4 levels should be encoded with 4 columns where each column consists of 1 and 0 entries indicating the presence of the respective level of the factor.

There are confint, fixef, print, ranef, residuals, sigma, summary, vcov, and VarCorr methods available for objects fitted with mmdml. They are called confint.mmdml, fixef.mmdml, print.mmdml, ranef.mmdml, residuals.mmdml, sigma.mmdml, summary.mmdml, vcov.mmdml, and VarCorr.mmdml, respectively.

Value

A list similar to the output of lmer from package lme4 containing the following entries.

`beta`	Estimator of the linear coefficient `\beta_0`.
`vcov`	Variance-covariance matrix of `beta`. Also see `lmer`. The `S` individual variance-covariance matrices are aggregated by first adding a correction term to them correcting for the randomness of the sample splits and by subsequently taking the median of the corrected variance-covariance matrices.
`sigma`	Please see `lmer` for its meaning. It is computed by averaging over the `K` sample splits and by aggregating the `S` repetitions using the median.
`theta`	Please see `lmer` for its meaning. It is computed by averaging over the `K` sample splits and by aggregating the `S` repetitions using the median.
`varcor`	Variance correlation components computed with `theta`. Please also see `lmer`.
`random_eff`	Conditional estimates of the random effects similarly to `lmer`. The individual sets of `S` random effects estimates are aggregated using the mean.
`random_eff_all`	The first `nr_random_eff` sets of the `S` sets of random effects estimates.
`residuals`	The first `nr_res` sets of the `S` sets of residuals. Each set of residuals is computed with parameters and data that is aggregated over the `K` sample splits.

The other elements ngrps, nobs, fitMsgs, cnms, nc, nms, useSc, optinfo, and methTitle are as in lmer. The gradient and Hessian information of optinfo is computed by aggregating the respective information over the S repetitions with the median.

References

C. Emmenegger and P. Bühlmann. Double Machine Learning for Partially Linear Mixed-Effects Models with Repeated Measurements. Preprint arXiv:2108.13657.

Examples

## generate data
RNGkind("L'Ecuyer-CMRG")
set.seed(19)
data1 <- example_data_mmdml(beta0 = 0.2)
data2 <- example_data_mmdml(beta0 = c(0.2, 0.2))

## fit models
## Caveat: Warning messages are displayed because the small number of
## observations results in a singular random effects model
fit1 <-
  mmdml(w = c("w1", "w2", "w3"), x = "x1", y = "resp", z = c("id", "cask"),
        data = data1, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3)

fit2 <-
  mmdml(w = c("w1", "w2", "w3"), x = c("x1", "x2"), y = "resp", z = c("id", "cask"),
        data = data2, z_formula = "(1|id) + (1|cask:id)", group = "id", S = 3)

## apply methods
confint(fit2)
fixef(fit2)
print(fit2)
ranef(fit2)
residuals(fit2)
sigma(fit2)
summary(fit2)
vcov(fit2)
VarCorr(fit2)

[Package dmlalg version 1.0.2 Index]