quickmatch {quickmatch} | R Documentation |
Derive generalized full matchings
Description
quickmatch
constructs near-optimal generalized full matchings. The
function expects the user to provide distances measuring the similarity of
units and a set of matching constraints. It then constructs a matching
so that units assigned to the same group are as similar as possible while
satisfying the matching constraints.
Usage
quickmatch(
distances,
treatments,
treatment_constraints = NULL,
size_constraint = NULL,
target = NULL,
caliper = NULL,
...
)
Arguments
distances |
|
treatments |
factor specifying the units' treatment assignments. |
treatment_constraints |
named integer vector with the treatment constraints. If |
size_constraint |
integer with the required total number of units in each group. Must be
greater or equal to the sum of |
target |
units to target the matching for. All units indicated by |
caliper |
restrict the maximum within-group distance. |
... |
additional parameters to be sent either to the |
Details
The treatment_constraints
parameter should be a named vector with
treatment-specific constraints. For example, in a sample with treatment
conditions "A", "B" and "C", the vector c("A" = 1, "B" = 2, "C" = 0)
specifies that each matched group should contain at least one unit with
treatment "A", at least two units with treatment "B" and any number of units
with treatment "C". Treatments not specified in the vector defaults to zero.
For example, the vector c("A" = 1, "B" = 2)
is identical to the
previous one. When treatment_constraints
is NULL
, the function
requires at least one unit for each treatment in each group. In our current
example, NULL
would be shorthand for c("A" = 1, "B" = 1, "C" = 1)
.
The size_constraint
parameter can be used to constrain the matched
groups to contain at least a certain number of units in total (independently
of treatment assignment). For example, if treatment_constraints =
c("A" = 1, "B" = 2)
and total_size_constraint = 4
, each matched
group will contain at least one unit assigned to "A", at least two units
assigned to "B" and at least four units in total, where the fourth unit can
be from any treatment condition.
The target
parameter can be used to control which units are included
in the matching. When target
is NULL
(the default), all units
will be assigned to a matched group. When not NULL
, the parameter
indicates that some units must be assigned to matched group and that the
remaining units can safely be ignored. This can be useful, for example,
when one is interested in estimating treatment effects only for a certain
type of units (e.g., the average treatment effect for the treated, ATT). It
is particularly useful when units of interested are not represented in the
whole covariate space (i.e., an one-sided overlap problem). Without the
target
parameter, the function would in such cases try to assign every
unit to a group, including units in sparse regions that we are not interested
in. This could lead to unnecessarily large and diverse matched groups. By
specifying that some units are of interest only insofar as they help us satisfy
the matching constraints (i.e., setting the target
parameter to the
appropriate value), we can avoid such situations.
Consider, as an example, a study with two treatment conditions, "A" and "B".
Units assigned to "B" are more numerous and tend to have more extreme
covariate values. We are, however, only interested in estimating the
treatment effect for units assigned to "A". By specifying target = "A"
,
the function ensures that all "A" units are assigned to matched groups. Some
units assigned to treatment "B" – in particular the units with extreme
covariate values – will be left unassigned. However, as those units are not
of interest, they can safely be ignored, and we avoid groups of poor quality.
Even if some of the units that can be ignored are not needed to satisfy the
matching constraints, it is rarely beneficial to discard them blindly; they can
occasionally provide useful information. The default behavior when target
is non-NULL is to assign as many of the ignorable units as possible given that
the within-group distances do not increase too much
(using secondary_unassigned_method = "estimated_radius"
). This behavior
might, however, reduce covariate balance in some instances. If called with
secondary_unassigned_method = "ignore"
, units not specified in
target
will be discarded unless they are absolutely needed to satisfying
the matching constraints. This tends to reduce bias since the within-group
distances are minimized, but it could increase variance since we ignore
potentially useful information in the sample. An intermediate alternative
is to specify an aggressive caliper for the ignorable units, which is done
with the secondary_radius
parameter. (These parameters are part of the
sc_clustering
function that quickmatch
calls.
The target
parameter corresponds to the primary_data_points
parameter in that function.)
The caliper
parameter constrains the maximum distance between units
assigned to the same matched group. This is implemented by restricting the
edge weight in the graph used to construct the matched groups (see
sc_clustering
for details). As a result, the caliper
will affect all groups in the matching and, in general, make it harder for
the function to find good matches even for groups where the caliper is not
binding. In particular, a too tight caliper
can lead to discarded
units that otherwise would be assigned to a group satisfying both the
matching constraints and the caliper. For this reason, it is recommended
to set the caliper
value quite high and only use it to avoid particularly
poor matches. It strongly recommended to use the caliper
parameter only
when primary_unassigned_method = "closest_seed"
in the underlying
sc_clustering
function (which is the default
behavior).
quickmatch
calls sc_clustering
with
seed_method = "inwards_updating"
. The seed_method
parameter
governs how the seeds are selected in the nearest neighborhood graph that
is used to construct the matched groups (see sc_clustering
for details). The "inwards_updating"
option generally works well
and is safe with most datasets. Using seed_method = "exclusion_updating"
often leads to better performance (in the sense of matched groups with more
similar units), but it may increase run time. Discrete data (or more generally
when units tend to be at equal distance to many other units) will lead to
particularly poor run time with this option. If the data set has at least one
continuous covariate, "exclusion_updating"
is typically reasonably
quick. A third option is seed_method = "lexical"
, which decreases the
run time relative to "inwards_updating"
(sometimes considerably) at
the cost of performance. quickmatch
passes parameters on to
sc_clustering
, so to change seed_method
, call
quickmatch
with the parameter specified as usual:
quickmatch(..., seed_method = "exclusion_updating")
.
Value
Returns a qm_matching
object with the matched groups.
References
Sävje, Fredrik, Michael J. Higgins and Jasjeet S. Sekhon (2017), ‘Generalized Full Matching’, arXiv 1703.03882. https://arxiv.org/abs/1703.03882
See Also
See sc_clustering
for the underlying function used
to construct the matched groups.
Examples
# Construct example data
my_data <- data.frame(y = rnorm(100),
x1 = runif(100),
x2 = runif(100),
treatment = factor(sample(rep(c("T1", "T2", "C"), c(25, 25, 50)))))
# Make distances
my_distances <- distances(my_data, dist_variables = c("x1", "x2"))
# Make matching with one unit from "T1", "T2" and "C" in each matched group
quickmatch(my_distances, my_data$treatment)
# Require at least two "C" in each group
quickmatch(my_distances,
my_data$treatment,
treatment_constraints = c("T1" = 1, "T2" = 1, "C" = 2))
# Require groups with at least six units in total
quickmatch(my_distances,
my_data$treatment,
treatment_constraints = c("T1" = 1, "T2" = 1, "C" = 2),
size_constraint = 6)
# Focus the matching to units assigned to "T1" and "T2" (i.e., all
# units assigned to "T1" or T2 will be assigned to a matched group).
# Units assigned to treatment "C" will be assigned to groups so to
# ensure that each group contains at least one unit of each treatment
# condition. Remaining "C" units could be left unassigned.
quickmatch(my_distances,
my_data$treatment,
target = c("T1", "T2"))
# Impose caliper
quickmatch(my_distances,
my_data$treatment,
caliper = 0.25)
# Call `quickmatch` directly with covariate data (ie., not pre-calculating distances)
quickmatch(my_data[c("x1", "x2")], my_data$treatment)
# Call `quickmatch` directly with covariate data using Mahalanobis distances
quickmatch(my_data[c("x1", "x2")],
my_data$treatment,
normalize = "mahalanobize")