fit_mixture {pldamixture}R Documentation

Adjustment Method

Description

Perform regression adjusted for mismatched data. The function currently supports Cox Proportional Hazards Regression (right-censored data only) and Generalized Linear Regression Models (Gaussian, Gamma, Poisson, and Logistic (binary models only)). Information about the underlying record linkage process can be incorporated into the method if available (e.g., assumed overall mismatch rate, safe matches, predictors of match status, or predicted probabilities of correct matches).

Usage

fit_mixture(
  formula,
  data,
  family = "gaussian",
  mformula,
  safematches,
  mrate,
  control = list(initbeta = "default", initgamma = "default", fy = "default", maxiter =
    1000, tol = 1e-04, cmaxiter = 1000),
  ...
)

Arguments

formula

a formula object for the outcome model, with the covariate(s) on the right of "~" and the response on the left. In the Cox proportional hazards setting, the response should be provided using the Surv function and the covariates should be separated by + signs.

data

a data.frame with linked data used in "formula" and "formula.m" (optional)

family

the type of regression model ("gaussian" - default, "poisson", "binomial", "gamma", "cox"). For Generalized Linear Models, standard link functions are used ("identity" for Gaussian, "log" for Poisson and Gamma, and "logit" for binomial).

mformula

a one-sided formula object for the mismatch indicator model, with the covariates on the right of "~". The default is an intercept-only model corresponding to a constant mismatch rate)

safematches

an indicator variable for safe matches (TRUE : record can be treated as a correct match and FALSE : record may be mismatched). The default is FALSE for all matches.

mrate

the assumed overall mismatch rate (a proportion between 0 and 1). If not provided, no overall mismatch rate is assumed.

control

an optional list variable to customize the initial parameter estimates ("initbeta" for the outcome model and "initgamma" for the mismatch indicator model), estimated marginal density of the response ("fy"), maximum iterations for the EM algorithm ("maxiter"), maximum iterations for the subroutine in the constrained logistic regression function ("cmaxiter"), and convergence tolerance for the termination of the EM algorithm ("tol").

...

the option to directly pass "control" arguments

Value

a list of results from the function called depending on the "family" specified.

coefficients

the outcome model coefficient estimates

match.prob

the correct match probabilities for all observations

objective

a variable that tracks the negative log pseudo-likelihood for all iterations of the EM algorithm.

family

the type of (outcome) regression model

standard.errors

the estimated standard errors

m.coefficients

the correct match model coefficient estimates

call

the matched call

wfit

an internal-use object for the predict function

dispersion

the dispersion parameter estimate when the family is a Generalized Linear Model

Lambdahat_0

the baseline cumulative hazard (using weighted Breslow estimator) when the family is "cox"

g_Lambdahat_0

the baseline cumulative hazard for the marginal density of the response variable (using Nelson-Aalen estimator) when the family is "cox"

Note

The references below discuss the implemented framework in more detail. The standard errors are estimated using Louis' method for the "cox" family (Bukke et al., 2023) and using the sandwich formula otherwise (Slawski et al., 2023).

*Corresponding Author (mslawsk3@gmu.edu)

References

Slawski, M.*, West, B. T., Bukke, P., Diao, G., Wang, Z., & Ben-David, E. (2023). A General Framework for Regression with Mismatched Data Based on Mixture Modeling. Under Review. < doi:10.48550/arXiv.2306.00909 >

Bukke, P., Ben-David, E., Diao, G., Slawski, M.*, & West, B. T. (2023). Cox Proportional Hazards Regression Using Linked Data: An Approach Based on Mixture Modelling. Under Review.

Slawski, M.*, Diao, G., Ben-David, E. (2021). A pseudo-likelihood approach to linear regression with partially shuffled data. Journal of Computational and Graphical Statistics. 30(4), 991-1003 < doi:10.1080/10618600.2020.1870482 >

Examples

## commonness score of first and last names used for linkage
mformula <- ~commf + comml
## hand-linked records are considered "safe" matches
safematches <- ifelse(lifem$hndlnk =="Hand-Linked At Some Level", TRUE, FALSE)
## overall mismatch rate in the data set is assumed to be ~ 0.05
mrate <- 0.05

fit <- fit_mixture(age_at_death ~ poly(unit_yob, 3, raw = TRUE), data = lifem,
                   family = "gaussian", mformula, safematches, mrate)


[Package pldamixture version 0.1.1 Index]