Sieve maximum likelihood estimator (SMLE) for two-phase linear regression problems


Performs efficient semiparametric estimation for general two-phase measurement error models when there are errors in both the outcome and covariates.


  Y_unval = NULL,
  Y = NULL,
  X_unval = NULL,
  X = NULL,
  Z = NULL,
  Bspline = NULL,
  data = NULL,
  hn_scale = 1,
  noSE = FALSE,
  TOL = 1e-04,
  MAX_ITER = 1000,
  verbose = FALSE



Column name of the error-prone or unvalidated continuous outcome. Subjects with missing values of Y_unval are omitted from the analysis. This argument is required.


Column name that stores the validated value of Y_unval in the second phase. Subjects with missing values of Y are considered as those not selected in the second phase. This argument is required.


Specifies the columns of the error-prone covariates. Subjects with missing values of X_unval are omitted from the analysis. This argument is required.


Specifies the columns that store the validated values of X_unval in the second phase. Subjects with missing values of X are considered as those not selected in the second phase. This argument is required.


Specifies the columns of the accurately measured covariates. Subjects with missing values of Z are omitted from the analysis. This argument is optional.


Specifies the columns of the B-spline basis. Subjects with missing values of Bspline are omitted from the analysis. This argument is required.


Specifies the name of the dataset. This argument is required.


Specifies the scale of the perturbation constant in the variance estimation. For example, if hn_scale = 0.5, then the perturbation constant is 0.5n1/20.5n^{-1/2}, where nn is the first-phase sample size. The default value is 1. This argument is optional.


If TRUE, then the variances of the parameter estimators will not be estimated. The default value is FALSE. This argument is optional.


Specifies the convergence criterion in the EM algorithm. The default value is 1E-4. This argument is optional.


Maximum number of iterations in the EM algorithm. The default number is 1000. This argument is optional.


If TRUE, then show details of the analysis. The default value is FALSE.



Stores the analysis results.


Stores the residual standard error.


Stores the covariance matrix of the regression coefficient estimates.


In parameter estimation, if the EM algorithm converges, then converge = TRUE. Otherwise, converge = FALSE.


In variance estimation, if the EM algorithm converges, then converge_cov = TRUE. Otherwise, converge_cov = FALSE.


Tao, R., Mercaldo, N. D., Haneuse, S., Maronge, J. M., Rathouz, P. J., Heagerty, P. J., & Schildcrout, J. S. (2021). Two-wave two-phase outcome-dependent sampling designs, with applications to longitudinal binary data. Statistics in Medicine, 40(8), 1863–1876.

See Also

cv_linear2ph() to calculate the average predicted log likelihood of this function.


 rho = -.3
 p = 0.3
 hn_scale = 1
 nsieve = 20

 n = 100
 n2 = 40
 alpha = 0.3
 beta = 0.4

 ### generate data
 simX = rnorm(n)
 epsilon = rnorm(n)
 simY = alpha+beta*simX+epsilon
 error = MASS::mvrnorm(n, mu=c(0,0), Sigma=matrix(c(1, rho, rho, 1), nrow=2))
 simS = rbinom(n, 1, p)
 simU = simS*error[,2]
 simW = simS*error[,1]
 simY_tilde = simY+simW
 simX_tilde = simX+simU
 id_phase2 = sample(n, n2)
 simY[-id_phase2] = NA
 simX[-id_phase2] = NA
 # # histogram basis
 # Bspline = matrix(NA, nrow=n, ncol=nsieve)
 # cut_x_tilde = cut(simX_tilde, breaks=quantile(simX_tilde, probs=seq(0, 1, 1/nsieve)), 
 #   include.lowest = TRUE)
 # for (i in 1:nsieve) {
 #     Bspline[,i] = as.numeric(cut_x_tilde == names(table(cut_x_tilde))[i])
 # }
 # colnames(Bspline) = paste("bs", 1:nsieve, sep="")
 # # histogram basis
 # # linear basis
 # Bspline = splines::bs(simX_tilde, df=nsieve, degree=1,
 #   Boundary.knots=range(simX_tilde), intercept=TRUE)
 # colnames(Bspline) = paste("bs", 1:nsieve, sep="")
 # # linear basis
 # # quadratic basis
 # Bspline = splines::bs(simX_tilde, df=nsieve, degree=2, 
 #   Boundary.knots=range(simX_tilde), intercept=TRUE)
 # colnames(Bspline) = paste("bs", 1:nsieve, sep="")
 # # quadratic basis
 # cubic basis
 Bspline = splines::bs(simX_tilde, df=nsieve, degree=3, 
   Boundary.knots=range(simX_tilde), intercept=TRUE)
 colnames(Bspline) = paste("bs", 1:nsieve, sep="")
 # cubic basis
 data = data.frame(Y_tilde=simY_tilde, X_tilde=simX_tilde, Y=simY, X=simX, Bspline)

 res = linear2ph(Y="Y", X="X", Y_unval="Y_tilde", X_unval="X_tilde", 
   Bspline=colnames(Bspline), data=data, hn_scale=0.1)

