R: Adjustment Method

fit_mixture {pldamixture}

R Documentation

Adjustment Method

Description

Perform regression adjusted for mismatched data. The function currently supports Cox Proportional Hazards Regression (right-censored data only) and Generalized Linear Regression Models (Gaussian, Gamma, Poisson, and Logistic (binary models only)). Information about the underlying record linkage process can be incorporated into the method if available (e.g., assumed overall mismatch rate, safe matches, predictors of match status, or predicted probabilities of correct matches).

Usage

fit_mixture(
  formula,
  data,
  family = "gaussian",
  mformula,
  safematches,
  mrate,
  control = list(initbeta = "default", initgamma = "default", fy = "default", maxiter =
    1000, tol = 1e-04, cmaxiter = 1000),
  ...
)

Arguments

`formula`	a formula object for the outcome model, with the covariate(s) on the right of "~" and the response on the left. In the Cox proportional hazards setting, the response should be provided using the `Surv` function and the covariates should be separated by + signs.
`data`	a data.frame with linked data used in "formula" and "formula.m" (optional)
`family`	the type of regression model ("gaussian" - default, "poisson", "binomial", "gamma", "cox"). For Generalized Linear Models, standard link functions are used ("identity" for Gaussian, "log" for Poisson and Gamma, and "logit" for binomial).
`mformula`	a one-sided formula object for the mismatch indicator model, with the covariates on the right of "~". The default is an intercept-only model corresponding to a constant mismatch rate)
`safematches`	an indicator variable for safe matches (TRUE : record can be treated as a correct match and FALSE : record may be mismatched). The default is FALSE for all matches.
`mrate`	the assumed overall mismatch rate (a proportion between 0 and 1). If not provided, no overall mismatch rate is assumed.
`control`	an optional list variable to customize the initial parameter estimates ("initbeta" for the outcome model and "initgamma" for the mismatch indicator model), estimated marginal density of the response ("fy"), maximum iterations for the EM algorithm ("maxiter"), maximum iterations for the subroutine in the constrained logistic regression function ("cmaxiter"), and convergence tolerance for the termination of the EM algorithm ("tol").
`...`	the option to directly pass "control" arguments

Value

a list of results from the function called depending on the "family" specified.

`coefficients`	the outcome model coefficient estimates
`match.prob`	the correct match probabilities for all observations
`objective`	a variable that tracks the negative log pseudo-likelihood for all iterations of the EM algorithm.
`family`	the type of (outcome) regression model
`standard.errors`	the estimated standard errors
`m.coefficients`	the correct match model coefficient estimates
`call`	the matched call
`wfit`	an internal-use object for the predict function
`dispersion`	the dispersion parameter estimate when the family is a Generalized Linear Model
`Lambdahat_0`	the baseline cumulative hazard (using weighted Breslow estimator) when the family is "cox"
`g_Lambdahat_0`	the baseline cumulative hazard for the marginal density of the response variable (using Nelson-Aalen estimator) when the family is "cox"

Note

The references below discuss the implemented framework in more detail. The standard errors are estimated using Louis' method for the "cox" family (Bukke et al., 2023) and using the sandwich formula otherwise (Slawski et al., 2023).

*Corresponding Author (mslawsk3@gmu.edu)

References

Slawski, M.*, West, B. T., Bukke, P., Diao, G., Wang, Z., & Ben-David, E. (2023). A General Framework for Regression with Mismatched Data Based on Mixture Modeling. Under Review. < doi:10.48550/arXiv.2306.00909 >

Bukke, P., Ben-David, E., Diao, G., Slawski, M.*, & West, B. T. (2023). Cox Proportional Hazards Regression Using Linked Data: An Approach Based on Mixture Modelling. Under Review.

Slawski, M.*, Diao, G., Ben-David, E. (2021). A pseudo-likelihood approach to linear regression with partially shuffled data. Journal of Computational and Graphical Statistics. 30(4), 991-1003 < doi:10.1080/10618600.2020.1870482 >

Examples

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05

fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)

[Package pldamixture version 0.1.1 Index]