augSIMEX {augSIMEX}R Documentation

Analysis of Data with Mixed Measurement Error and Misclassification in Covariates


Implementation of the SIMEX algorithm for data with mixed measurement error and misclassification in covariates.


augSIMEX(mainformula = formula(data), mismodel = pi | qi ~ 1,
                 meformula = NULL, family = gaussian, data,
                 validationdata, err.var, mis.var, err.true, mis.true,
                 err.mat = NULL, cppmethod = TRUE, repeated = FALSE,
                 repind = list(), subset, offset, weights, na.action,
                 scorefunction = NULL, lambda = NULL, M = 5, B = 20,
                 nBoot = 50, extrapolation = c("quadratic", "linear"), bound = 8,
                 initial = NULL, ...)



an object of class “formula”: an object of class “formula”: a symbolic description of the model of the response variable, error-prone covariates, and other covariates.


an object of class “Formula”. A symbolic description in modeling the misclassification rates. See details for the specification of the model.


an object of “Formula” specifying the measurement error model for each error-prone covariate. The number of responses should equal the number of error-prone covariates. The default choice will be classic additive models. See details for the specification of the model.


an object of class “family” (same as in glm function in stats package).


a data frame or a matrix of main data. The variable in the main data includes the response, observed covariates matrix which is subject to measurement error, observed binary covariate vector that is prone to misclassification and may also contain the precisely measured covariates matrix. The default choice includes all covariates that mentioned in the mainformula.


a data frame or matrix of validation data. The variable in the model includes an observed covariates matrix of that is subject to measurement error and their corresponding precisely measured covariates, an observed binary covariate vector that is prone to misclassification and their corresponding precisely measured covariates, a covariates matrix of precisely measured covariates.


a vector of character specifying the name of covariates that are subject to measurement error.


a string specifying the variable name of the binary variable subject to misclassification.


a vector of character specifying the names of the all the counterparts of err.var that are precisely measured covariates.


a string specifying the variable name of the precisely measured binary variable.


a matrix indicating the variance-covariance matrix of generated Normal distribution in simulation step.


a logical value indicating whether solving the score function via C++ functions. The function involves Rcpp package. The C++ based method is much faster and more computationally efficient than the R based method. See the discussion section.


a logic value indicating whether repeated measurements are involved.


a list of vectors of repeated measurement for error-prone covariates. See detail below.


a logical vector indicating which subjects should be included in the fitting.


a numeric vector indicating the offset. The default choice is null.


an optional numeric vector of the weights should be used in the fitting.


a function indicating what method should be applied to deal with missing data.


a function of score function. To allow for the generality, the users can specify their own score function. It should be a function of parameters, response, covariates, weights and offset. A matrix of score value for each individual and each parameter should be returned in the function. An example is shown in glmscore


if the M value is not specified, the user can specify a positive sequence of lambda directly. The first element is usually set to be 0.


the number of predetermined lambda vector if lambda is missing. The default is chosen to be 50.


the number of the dataset generated in the simulation step. The default value is set to be 200.


the number of the iterations repeated in bootstrap. The default value is set to be 100.


specifies the regression model that involves in the extrapolation step. The options include “linear” and “quadratic”. The default is set to be “quadratic”.


a value or vector specififying the bound for the absolute value of the regression coefficients. During the simulation, the parameters out of bound will be filter out before extrapolation step.


the initial value of the parameters.


other arguments that pass into the function.


The misclassification models are set in "Formula" format, where the misclassification rates for both classes of the binary variable is set simultaneously. The left-hand side sets the responses of the misclassification model. The response should always set be to pi (indicating pimodel) and qi (indicating qimodel), separated by "|". On the right hand side of the formula sets the covariates of the misclassification model. If the covariates for both models are different, use "|" to separate them in the same order as the response. See example.

The measurement error models are also set in "Formula" format. The left-hand side sets the responses of the measurement error model, which should be consistent with the specification of err.var. Each response is separated by "|". On the right-hand side of the formula, the covariates are set. If the covariates for each response are different, use "|" to separate the specifications of covariates and the order should correspond to that of the responses of the measurement error model. See example.

In the case of repeated measurements, the users can pick one of the measurements into the formula of the main model. If more than one covariates have multiple replicates, the users should name the vector of repeated measurements in the list of repind by the corresponding representative measurement in the formula of the main model. See example.

The number of bootstrap repetitions should be at least 2. i.e., nBoot>2. Otherwise, an error might occur.

The examples are mainly for illustration purpose. NA's are possible to be generated because the bootstrap simulation parameter is only set as 2. To obtain precise results, simulation parameters are supposed to be set on a larger scale, which also involves more time for computing.



the coefficient of the main model after correction.


a adjusted variance-covariance matrix estimated by bootstrap.


the fitted mean values, obtained by transforming the linear predictors by the inverse of the link function.


a vector of values that involved in extrapolation step.


a matrix of coefficient before doing extrapolation. Each row correspond to a specified lambda. Each column represent a component of coefficient. This output if for the plotting purpose.


minus twice the maximized log-likelihood.


the deviance of null model which only includes intercept and offset term. It is comparable to the deviance.


the degree of freedom.


the degree of freedom of null model.


Akaike's Information Criterion, minus twice the maximized log-likelihood plus twice the number of parameters.


Other components are the arguments that have been used in the function.


Qihuang Zhang and Grace Y. Yi


Yi G Y, Ma Y, Spiegelman D, et al. Functional and structural methods with mixed measurement error and misclassification in covariates[J]. Journal of the American Statistical Association, 2015, 110(510): 681-696.

See Also



### Example 1: Univariate Case
example1<-augSIMEX(mainformula = Y ~ Xstar + Zstar + W, family = binomial(link = logit),
  mismodel = pi|qi ~ W, 
  meformula = Xstar ~ X + Z + W,
  data = ToyUni$Main,validationdata = ToyUni$Validation, subset = NULL,
  err.var = "Xstar", mis.var = "Zstar", err.true = "X", mis.true = "Z", 
  err.mat = NULL,
  lambda = NULL, M = 5, B = 2, nBoot = 2, extrapolation="quadratic")                   

## Without adjustment
example1_naive <- glm(formula = Y ~ Xstar + Zstar + W,
family = binomial(link = logit),data = ToyUni$Main)

## using  accurate data
example1_true <- glm(Y~Xstar+Zstar+W, family = binomial(link=logit), 

### Example 2: Multivariate Case
ErrorFormula<-Xstar.X1|Xstar.X2~-1+X.X1|-1+X.X2   ## measurement error model
example2<-augSIMEX(mainformula = Y~Xstar.X1+Xstar.X2+Zstar+W.W1+W.W2, 
  mismodel=pi|qi~X.X1+X.X2+W.W1+W.W2, family = binomial,
  validationdata=ToyMult$Validation, subset=NULL,
  err.var=c("Xstar.X1","Xstar.X2"), mis.var="Zstar", err.true=c("X.X1","X.X2"), 
  mis.true="Z", err.mat = NULL,
  lambda=NULL, M=5, B=2, nBoot=2, extrapolation="quadratic")

### Example 3
example3<-augSIMEX(mainformula = Y~Xstar1+Zstar+W, family = binomial(link=logit),
  mismodel = pi|qi ~ W, meformula = Xstar ~ X + Z + W,
  subset=NULL, err.var="Xstar1", mis.var="Zstar", err.true="X", mis.true="Z", 
  err.mat = NULL, repeated = TRUE,repind=list(Xstar1=c("Xstar1","Xstar2")),
  lambda=NULL, M=5, B=2, nBoot=2, extrapolation="quadratic")

[Package augSIMEX version 3.7.4 Index]