mixed.mtc {StatMatch} | R Documentation |
Statistical Matching via Mixed Methods
Description
This function implements some mixed methods to perform statistical matching between two data sources.
Usage
mixed.mtc(data.rec, data.don, match.vars, y.rec, z.don, method="ML",
rho.yz=NULL, micro=FALSE, constr.alg="Hungarian")
Arguments
data.rec |
A matrix or data frame that plays the role of recipient in the statistical matching application. This data set must contain all variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments |
data.don |
A matrix or data frame that plays the role of donor in the statistical matching application. This data set must contain all the numeric variables (columns) that should be used in statistical matching, i.e. the variables called by the arguments |
match.vars |
A character vector with the names of the common variables (the columns in both the data frames) to be used as matching variables (X). |
y.rec |
A character vector with the name of the target variable Y that is observed only for units in |
z.don |
A character vector with the name of the target variable Z that is observed only for units in |
method |
A character vector that identifies the method that should be used to estimate the parameters of the regression models: Y vs. X and Z vs. X. Maximum Likelihood method is used when |
rho.yz |
A numeric value representing a guess for the correlation between the Y ( By default ( |
micro |
Logical. When |
constr.alg |
A string that has to be specified when |
Details
This function implements some mixed methods to perform statistical matching. A mixed method consists of two steps:
(1) adoption of a parametric model for the joint distribution of \left( \mathbf{X},Y,Z \right)
and estimation of its parameters;
(2) derivation of a complete “synthetic” data set (recipient data set filled in with values for the Z variable) using a nonparametric approach.
In this case, as far as (1) is concerned, it is assumed that \left( \mathbf{X},Y,Z \right)
follows a multivariate normal distribution. Please note that if some of the X are categorical, then they are recoded into dummies before starting with the estimation. In such a case, the assumption of multivariate normal distribution may be questionable.
The whole procedure is based on the imputation method known as predictive mean matching. The procedure consists of three steps:
step 1a) Regression step: the two linear regression models Y vs. X and Z vs. X are considered and their parameters are estimated.
step 1b) Computation of intermediate values. For the units in data.rec
the following intermediate values are derived:
\tilde{z}_{a} = \hat{\alpha}_{Z} + \hat{\beta}_{Z\bf{X}} \mathbf{x}_a + e_a
for each a=1,\ldots,n_{A}
, being n_A
the number of units in data.rec
(rows of data.rec
). Note that, e_a
is a random draw from the multivariate normal distribution with zero mean and estimated residual variance \hat{\sigma}_{Z|\bf{X}}
.
Similarly, for the units in data.don
the following intermediate values are derived:
\tilde{y}_{b} = \hat{\alpha}_{Y} + \hat{\beta}_{Y\bf{X}} \mathbf{x}_b + e_b
for each b=1,\ldots,n_{B}
, being n_B
the number of units in data.don
(rows of data.don
). e_b
is a random draw from the multivariate normal distribution with zero mean and estimated residual variance \hat{\sigma}_{Y|\bf{X}}
.
step 2) Matching step. For each observation (row) in data.rec
a donor is chosen in data.don
through a nearest neighbor constrained distance hot deck procedure. The distances are computed between \left( y_a, \tilde{z}_a \right)
and \left( \tilde{y}_b, z_b \right)
using Mahalanobis distance.
For further details see Sections 2.5.1 and 3.6.1 in D'Orazio et al. (2006).
In step 1a) the parameters of the regression model can be estimated by means of the Maximum Likelihood method (method="ML"
) (see D'Orazio et al., 2006, pp. 19–23,73–75) or, using the Moriarity and Scheuren (2001 and 2003) approach (method="MS"
) (see also D'Orazio et al., 2006, pp. 75–76). The two estimation methods are compared in D'Orazio et al. (2005).
When method="MS"
, if the value specified for the argument rho.yz
is not compatible with the other correlation coefficients estimated from the data, then it is substituted with the closest value compatible with the other estimated coefficients.
When micro=FALSE
only the estimation of the parameters is performed (step 1a). Otherwise,
(micro=TRUE
) the whole procedure is carried out.
Value
A list with a varying number of components depending on the values of the arguments
method
and rho.yz
.
mu |
The estimated mean vector. |
vc |
The estimated variance–covariance matrix. |
cor |
The estimated correlation matrix. |
res.var |
A vector with estimates of the residual variances |
start.prho.yz |
It is the initial guess for the partial correlation coefficient |
rho.yz |
Returned in output only when |
phi |
When |
filled.rec |
The |
mtc.ids |
when |
dist.rd |
A vector with the distances between each recipient unit and the corresponding donor, returned only in case |
call |
How the function has been called. |
Author(s)
Marcello D'Orazio mdo.statmatch@gmail.com
References
D'Orazio, M., Di Zio, M. and Scanu, M. (2005). “A comparison among different estimators of regression parameters on statistically matched files through an extensive simulation study”, Contributi, 2005/10, Istituto Nazionale di Statistica, Rome.
D'Orazio, M., Di Zio, M. and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester.
Hornik K. (2012). clue: Cluster ensembles. R package version 0.3-45. https://CRAN.R-project.org/package=clue.
Moriarity, C., and Scheuren, F. (2001). “Statistical matching: a paradigm for assessing the uncertainty in the procedure”. Journal of Official Statistics, 17, 407–422.
Moriarity, C., and Scheuren, F. (2003). “A note on Rubin's statistical matching using file concatenation with adjusted weights and multiple imputation”, Journal of Business and Economic Statistics, 21, 65–73.
See Also
Examples
# reproduce the statistical matching framework
# starting from the iris data.frame
suppressWarnings(RNGversion("3.5.0"))
set.seed(98765)
pos <- sample(1:150, 50, replace=FALSE)
ir.A <- iris[pos,c(1,3:5)]
ir.B <- iris[-pos, 2:5]
xx <- intersect(colnames(ir.A), colnames(ir.B))
xx # common variables
# ML estimation method under CIA ((rho_YZ|X=0));
# only parameter estimates (micro=FALSE)
# only continuous matching variables
xx.mtc <- c("Petal.Length", "Petal.Width")
mtc.1 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
y.rec="Sepal.Length", z.don="Sepal.Width")
# estimated correlation matrix
mtc.1$cor
# ML estimation method under CIA ((rho_YZ|X=0));
# only parameter estimates (micro=FALSE)
# categorical variable 'Species' used as matching variable
xx.mtc <- xx
mtc.2 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
y.rec="Sepal.Length", z.don="Sepal.Width")
# estimated correlation matrix
mtc.2$cor
# ML estimation method with partial correlation coefficient
# set equal to 0.5 (rho_YZ|X=0.5)
# only parameter estimates (micro=FALSE)
mtc.3 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
y.rec="Sepal.Length", z.don="Sepal.Width",
rho.yz=0.5)
# estimated correlation matrix
mtc.3$cor
# ML estimation method with partial correlation coefficient
# set equal to 0.5 (rho_YZ|X=0.5)
# with imputation step (micro=TRUE)
mtc.4 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
y.rec="Sepal.Length", z.don="Sepal.Width",
rho.yz=0.5, micro=TRUE, constr.alg="Hungarian")
# first rows of data.rec filled in with z
head(mtc.4$filled.rec)
#
# Moriarity and Scheuren estimation method under CIA;
# only with parameter estimates (micro=FALSE)
mtc.5 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
y.rec="Sepal.Length", z.don="Sepal.Width",
method="MS")
# the starting value of rho.yz and the value used
# in computations
mtc.5$rho.yz
# estimated correlation matrix
mtc.5$cor
# Moriarity and Scheuren estimation method
# with correlation coefficient set equal to -0.15 (rho_YZ=-0.15)
# with imputation step (micro=TRUE)
mtc.6 <- mixed.mtc(data.rec=ir.A, data.don=ir.B, match.vars=xx.mtc,
y.rec="Sepal.Length", z.don="Sepal.Width",
method="MS", rho.yz=-0.15,
micro=TRUE, constr.alg="lpSolve")
# the starting value of rho.yz and the value used
# in computations
mtc.6$rho.yz
# estimated correlation matrix
mtc.6$cor
# first rows of data.rec filled in with z imputed values
head(mtc.6$filled.rec)