cem {cem} | R Documentation |
Coarsened Exact Matching
Description
Implementation of Coarsened Exact Matching
Usage
cem(treatment=NULL, data = NULL, datalist=NULL, cutpoints = NULL,
grouping = NULL, drop=NULL, eval.imbalance = FALSE, k2k=FALSE,
method=NULL, mpower=2, L1.breaks = NULL, L1.grouping = NULL,
verbose = 0, baseline.group="1",keep.all=FALSE)
Arguments
treatment |
character, name of the treatment variable |
baseline.group |
character, name of the baseline level treatment. See Details. |
data |
a data.frame |
datalist |
a list of optional multiply imputed data.frame's |
cutpoints |
named list each describing the cutpoints for numerical variables (the names are variable names). Each list element is either a vector of cutpoints, a number of cutpoints, or a method for automatic bin contruction. See Details. |
grouping |
named list, each element of which is a list of groupings for a single categorical variable. See Details. |
drop |
a vector of variable names in the data frame to ignore during matching |
eval.imbalance |
Boolean. See Details. |
k2k |
boolean, restrict to k-to-k matching? Default = |
method |
distance method to use in |
mpower |
power of the Minkowski distance. See Details. |
L1.breaks |
list of cutpoints for the calculation of the L1 measure. |
L1.grouping |
as |
verbose |
controls level of verbosity. Default=0. |
keep.all |
if |
Details
For multilevel (and a binary) treatment variables, the cem weights
are calulated with respect to the baseline
. Therefore,
matched units with treatment variable equal to the baseline level receive weight 1, the others the usual cem weights. Unless specified,
by default baseline
is set
to "1"
. If this level is not one of the possible values taken by
the treatment
variable, then the baseline is set to the first level of the treatment
variable.
When specifying cutpoints, several automatic methods may be chosen, including
“sturges
” (Sturges' rule, the default),
“fd
” (Freedman-Diaconis' rule), “scott
”
(Scott's rule) and “ss” (Shimazaki-Shinomoto's rule).
See references for a description of each rule.
The grouping
option is a list where each element is itself a
list. For example, suppose for variable quest1
you have the
following possible levels "no answer", NA, "negative", "neutral",
"positive"
and you want to collect ("no answer", NA, "neutral")
into a single group, then the grouping
argument should contain
list(quest1=list(c("no answer", NA, "neutral")))
. Or if you have
a discrete variable elements
with values 1:10
and you want
to collect it into groups “1:3,NA
”, “4
”,
“5:9
”, “10
” you specify in grouping
the
following list list(elements=list(c(1:3,NA), 5:9))
. Values not
defined in the grouping
are left as they are. If cutpoints
and groupings
are defined for the same variable, the
groupings
take precedence and the corresponding cutpoints are set
to NULL
.
verbose
: a number greater or equal to 0. The higher, the
more info are provided during the execution of the algorithm.
If eval.imbalance
= TRUE
,
cem$imbalance
contains the imbalance measure by absolute
difference in means for numerical variables and chi-square distance for
categorical variables. If FALSE
(the default) then cem$imbalance
is set
to NULL
. If data contains missing data, the imbalance measures
are not calculated.
If L1.breaks
is missing, the default rule to calculate cutpoints
is the Scott's rule.
If k2k
is set to TRUE
, the algorithm return strata with
the same number of treated and control units per stratum, otherwise all
the matched units are returned (default). When k2k
= TRUE
,
the user can choose a method
(between 'euclidean
',
'maximum
', 'manhattan
', 'canberra
', 'binary
'
and 'minkowski
') for nearest neighbor matching inside each
cem
strata. By default method
is set to 'NULL
',
which means random matching inside cem
strata. For the Minkowski
distance the power can be specified via the argument mpower
'.
For more information on method != NULL
, refer to
dist
help page.
If k2k
is set to TRUE
also keep.all
is set to TRUE
.
By default, cem
treats missing values as distinct categories and
matches observations with missing values in the same variable in the
same stratum provided that all the remaining (corasened) covariates
match.
If argument data
is non-NULL
and datalist
is
NULL
, CEM is applied to the single data set in data
.
Argument datalist
is a list of (multiply imputed) data frames
(i.e., with missing cell values imputed). If data
is
NULL
, the function cem
is applied independently to each
element of the list, resulting in separately matched data sets with
different numbers of treated and control units.
When data
and datalist
are both non-NULL
, each
multiply imputed observation is assigned to the stratum in which it has
been matched most frequently. In this case, the algorithm outputs the
same matching solution for each multiply imputed data set (i.e., an
observation, and the number of treated and control units matched, in one
data set has the same meaning in all, and is the same for all)
Value
Returns an object of class cem.match
if only data
is not
NULL
or an object of class cem.match.list
, which is a list of
objects of class cem.match
plus a field called unique
which
is true only if data
and datalist
are not both NULL
.
A cem.match
object is a list
with the following slots:
call |
the call |
strata |
vector of stratum number in which each observation belongs, NA if the observation has not been matched |
n.strata |
number of strata generated |
vars |
report variables names used for the match |
drop |
variables removed from the match |
X |
the coarsened dataset or NULL if |
breaks |
named list of cutpoints, eventually NULL |
treatment |
name of the treatment variable |
groups |
factor, each observation belong to one group generated by the treatment variable |
n.groups |
number of groups identified by the treatment variable |
group.idx |
named list, index of observations belonging to each group |
group.len |
sizes of groups |
tab |
summary table of matched by group |
imbalance |
NULL or a vector of imbalances. See Details. |
Author(s)
Stefano Iacus, Gary King, and Giuseppe Porro
References
Iacus, King, Porro (2011) doi:10.1198/jasa.2011.tm09599
Iacus, King, Porro (2012) doi:10.1093/pan/mpr013
Iacus, King, Porro (2019) doi:10.1017/pan.2018.29
Shimazaki, Shinomoto (2007) doi:10.1162/neco.2007.19.6.1503
Examples
data(LL)
todrop <- c("treated","re78")
imbalance(LL$treated, LL, drop=todrop)
# cem match: automatic bin choice
mat <- cem(treatment="treated", data=LL, drop="re78")
mat
# cem match: user choiced coarsening
re74cut <- hist(LL$re74, br=seq(0,max(LL$re74)+1000, by=1000),plot=FALSE)$breaks
re75cut <- hist(LL$re75, br=seq(0,max(LL$re75)+1000, by=1000),plot=FALSE)$breaks
agecut <- hist(LL$age, br=seq(15,55, length=14),plot=FALSE)$breaks
mycp <- list(re75=re75cut, re74=re74cut, age=agecut)
mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp)
mat
# cem match: user choiced coarsening, k-to-k matching
mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp,k2k=TRUE)
mat
# mahalnobis matching: we use MatchIt
if(require(MatchIt)){
mah <- matchit(treated~age+education+re74+re75+black+hispanic+nodegree+married+u74+u75,
distance="mahalanobis", data=LL)
mah
#imbalance
imbalance(LL$treated, LL, drop=todrop, weights=mah$weights)
}
# Multiply Imputed data
# making use of Amelia for multiple imputation
if(require(Amelia)){
data(LL)
n <- dim(LL)[1]
k <- dim(LL)[2]
set.seed(123)
LL1 <- LL
idx <- sample(1:n, .3*n)
for(i in idx){
LL1[i,sample(2:k,1)] <- NA
}
imputed <- amelia(LL1,noms=c("black","hispanic","treated","married",
"nodegree","u74","u75"))
imputed <- imputed$imputations[1:5]
# without information on which observation has missing values
mat1 <- cem("treated", datalist=imputed, drop="re78")
mat1
# ATT estimation
out <- att(mat1, re78 ~ treated, data=imputed)
# with information about missingness
mat2 <- cem("treated", datalist=imputed, drop="re78", data=LL1)
mat2
# ATT estimation
out <- att(mat2, re78 ~ treated, data=imputed)
}