cem {cem} | R Documentation |

Implementation of Coarsened Exact Matching

cem(treatment=NULL, data = NULL, datalist=NULL, cutpoints = NULL, grouping = NULL, drop=NULL, eval.imbalance = FALSE, k2k=FALSE, method=NULL, mpower=2, L1.breaks = NULL, L1.grouping = NULL, verbose = 0, baseline.group="1",keep.all=FALSE)

`treatment` |
character, name of the treatment variable |

`baseline.group` |
character, name of the baseline level treatment. See Details. |

`data` |
a data.frame |

`datalist` |
a list of optional multiply imputed data.frame's |

`cutpoints` |
named list each describing the cutpoints for numerical variables (the names are variable names). Each list element is either a vector of cutpoints, a number of cutpoints, or a method for automatic bin contruction. See Details. |

`grouping` |
named list, each element of which is a list of groupings for a single categorical variable. See Details. |

`drop` |
a vector of variable names in the data frame to ignore during matching |

`eval.imbalance` |
Boolean. See Details. |

`k2k` |
boolean, restrict to k-to-k matching? Default = |

`method` |
distance method to use in |

`mpower` |
power of the Minkowski distance. See Details. |

`L1.breaks` |
list of cutpoints for the calculation of the L1 measure. |

`L1.grouping` |
as |

`verbose` |
controls level of verbosity. Default=0. |

`keep.all` |
if |

For multilevel (and a binary) treatment variables, the cem weights
are calulated with respect to the `baseline`

. Therefore,
matched units with treatment variable equal to the baseline level receive weight 1, the others the usual cem weights. Unless specified,
by default `baseline`

is set
to `"1"`

. If this level is not one of the possible values taken by
the `treatment`

variable, then the baseline is set to the first level of the `treatment`

variable.

When specifying cutpoints, several automatic methods may be chosen, including
“`sturges`

” (Sturges' rule, the default),
“`fd`

” (Freedman-Diaconis' rule), “`scott`

”
(Scott's rule) and “ss” (Shimazaki-Shinomoto's rule).
See references for a description of each rule.

The `grouping`

option is a list where each element is itself a
list. For example, suppose for variable `quest1`

you have the
following possible levels ```
"no answer", NA, "negative", "neutral",
"positive"
```

and you want to collect `("no answer", NA, "neutral")`

into a single group, then the `grouping`

argument should contain
`list(quest1=list(c("no answer", NA, "neutral")))`

. Or if you have
a discrete variable `elements`

with values `1:10`

and you want
to collect it into groups “`1:3,NA`

”, “`4`

”,
“`5:9`

”, “`10`

” you specify in `grouping`

the
following list `list(elements=list(c(1:3,NA), 5:9))`

. Values not
defined in the `grouping`

are left as they are. If `cutpoints`

and `groupings`

are defined for the same variable, the
`groupings`

take precedence and the corresponding cutpoints are set
to `NULL`

.

`verbose`

: a number greater or equal to 0. The higher, the
more info are provided during the execution of the algorithm.

If `eval.imbalance`

= `TRUE`

,
`cem$imbalance`

contains the imbalance measure by absolute
difference in means for numerical variables and chi-square distance for
categorical variables. If `FALSE`

(the default) then `cem$imbalance`

is set
to `NULL`

. If data contains missing data, the imbalance measures
are not calculated.

If `L1.breaks`

is missing, the default rule to calculate cutpoints
is the Scott's rule.

If `k2k`

is set to `TRUE`

, the algorithm return strata with
the same number of treated and control units per stratum, otherwise all
the matched units are returned (default). When `k2k`

= `TRUE`

,
the user can choose a `method`

(between '`euclidean`

',
'`maximum`

', '`manhattan`

', '`canberra`

', '`binary`

'
and '`minkowski`

') for nearest neighbor matching inside each
`cem`

strata. By default `method`

is set to '`NULL`

',
which means random matching inside `cem`

strata. For the Minkowski
distance the power can be specified via the argument `mpower`

'.
For more information on `method != NULL`

, refer to
`dist`

help page.
If `k2k`

is set to `TRUE`

also `keep.all`

is set to `TRUE`

.

By default, `cem`

treats missing values as distinct categories and
matches observations with missing values in the same variable in the
same stratum provided that all the remaining (corasened) covariates
match.

If argument `data`

is non-`NULL`

and `datalist`

is
`NULL`

, CEM is applied to the single data set in `data`

.

Argument `datalist`

is a list of (multiply imputed) data frames
(i.e., with missing cell values imputed). If `data`

is
`NULL`

, the function `cem`

is applied independently to each
element of the list, resulting in separately matched data sets with
different numbers of treated and control units.

When `data`

and `datalist`

are both non-`NULL`

, each
multiply imputed observation is assigned to the stratum in which it has
been matched most frequently. In this case, the algorithm outputs the
same matching solution for each multiply imputed data set (i.e., an
observation, and the number of treated and control units matched, in one
data set has the same meaning in all, and is the same for all)

Returns an object of class `cem.match`

if only `data`

is not
`NULL`

or an object of class `cem.match.list`

, which is a list of
objects of class `cem.match`

plus a field called `unique`

which
is true only if `data`

and `datalist`

are not both `NULL`

.
A `cem.match`

object is a list
with the following slots:

`call` |
the call |

`strata` |
vector of stratum number in which each observation belongs, NA if the observation has not been matched |

`n.strata` |
number of strata generated |

`vars` |
report variables names used for the match |

`drop` |
variables removed from the match |

`X` |
the coarsened dataset or NULL if |

`breaks` |
named list of cutpoints, eventually NULL |

`treatment` |
name of the treatment variable |

`groups` |
factor, each observation belong to one group generated by the treatment variable |

`n.groups` |
number of groups identified by the treatment variable |

`group.idx` |
named list, index of observations belonging to each group |

`group.len` |
sizes of groups |

`tab` |
summary table of matched by group |

`imbalance` |
NULL or a vector of imbalances. See Details. |

Stefano Iacus, Gary King, and Giuseppe Porro

Iacus, King, Porro (2011) doi: 10.1198/jasa.2011.tm09599

Iacus, King, Porro (2012) doi: 10.1093/pan/mpr013

Iacus, King, Porro (2019) doi: 10.1017/pan.2018.29

Shimazaki, Shinomoto (2007) doi: 10.1162/neco.2007.19.6.1503

data(LL) todrop <- c("treated","re78") imbalance(LL$treated, LL, drop=todrop) # cem match: automatic bin choice mat <- cem(treatment="treated", data=LL, drop="re78") mat # cem match: user choiced coarsening re74cut <- hist(LL$re74, br=seq(0,max(LL$re74)+1000, by=1000),plot=FALSE)$breaks re75cut <- hist(LL$re75, br=seq(0,max(LL$re75)+1000, by=1000),plot=FALSE)$breaks agecut <- hist(LL$age, br=seq(15,55, length=14),plot=FALSE)$breaks mycp <- list(re75=re75cut, re74=re74cut, age=agecut) mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp) mat # cem match: user choiced coarsening, k-to-k matching mat <- cem(treatment="treated",data=LL, drop="re78",cutpoints=mycp,k2k=TRUE) mat # mahalnobis matching: we use MatchIt if(require(MatchIt)){ mah <- matchit(treated~age+education+re74+re75+black+hispanic+nodegree+married+u74+u75, distance="mahalanobis", data=LL) mah #imbalance imbalance(LL$treated, LL, drop=todrop, weights=mah$weights) } # Multiply Imputed data # making use of Amelia for multiple imputation if(require(Amelia)){ data(LL) n <- dim(LL)[1] k <- dim(LL)[2] set.seed(123) LL1 <- LL idx <- sample(1:n, .3*n) for(i in idx){ LL1[i,sample(2:k,1)] <- NA } imputed <- amelia(LL1,noms=c("black","hispanic","treated","married", "nodegree","u74","u75")) imputed <- imputed$imputations[1:5] # without information on which observation has missing values mat1 <- cem("treated", datalist=imputed, drop="re78") mat1 # ATT estimation out <- att(mat1, re78 ~ treated, data=imputed) # with information about missingness mat2 <- cem("treated", datalist=imputed, drop="re78", data=LL1) mat2 # ATT estimation out <- att(mat2, re78 ~ treated, data=imputed) }

[Package *cem* version 1.1.29 Index]