quickblock {quickblock} | R Documentation |
Construct threshold blockings
Description
quickblock
constructs near-optimal threshold blockings. The function
expects the user to provide distances measuring the similarity of
units and a required minimum block size. It then constructs a blocking
so that units assigned to the same block are as similar as possible while
satisfying the minimum block size.
Usage
quickblock(
distances,
size_constraint = 2L,
caliper = NULL,
break_large_blocks = FALSE,
...
)
Arguments
distances |
|
size_constraint |
integer with the required minimum number of units in each block. |
caliper |
restrict the maximum within-block distance. |
break_large_blocks |
logical indicating whether large blocks should be broken up into smaller blocks. |
... |
additional parameters to be sent either to the |
Details
The caliper
parameter constrains the maximum distance between units
assigned to the same block. This is implemented by restricting the
edge weight in the graph used to construct the blocks (see
sc_clustering
for details). As a result, the caliper
will affect all blocks and, in general, make it harder for
the function to find good matches even for blocks where the caliper is not
binding. In particular, a too tight caliper
can lead to discarded
units that otherwise would be assigned to a block satisfying both the
matching constraints and the caliper. For this reason, it is recommended
to set the caliper
value quite high and only use it to avoid particularly
poor blocks. It strongly recommended to use the caliper
parameter only
when primary_unassigned_method = "closest_seed"
in the underlying
sc_clustering
function (which is the default
behavior).
The main algorithm used to construct the blocking may produce
some blocks that are much larger than the minimum size constraint. If
break_large_blocks
is TRUE
, all blocks twice as large as
size_constraint
will be broken into two or more smaller blocks. Block
are broken so to ensure that the new blocks satisfy the size constraint.
In general, large blocks are produced when units are highly clustered,
so breaking up large blocks will often only lead to small improvements. The
blocks are broken using the hierarchical_clustering
function.
quickblock
calls sc_clustering
with
seed_method = "inwards_updating"
. The seed_method
parameter
governs how the seeds are selected in the nearest neighborhood graph that
is used to construct the blocks (see sc_clustering
for details). The "inwards_updating"
option generally works well
and is safe with most datasets. Using seed_method = "exclusion_updating"
often leads to better performance (in the sense of blocks with more
similar units), but it may increase run time. Discrete data (or more generally
when units tend to be at equal distance to many other units) will lead to
particularly poor run time with this option. If the dataset has at least one
continuous covariate, "exclusion_updating"
is typically quick. A third
option is seed_method = "lexical"
, which decreases the run time relative
to "inwards_updating"
(sometimes considerably) at the cost of performance.
quickblock
passes parameters on to sc_clustering
,
so to change seed_method
, call quickblock
with the parameter
specified as usual: quickblock(..., seed_method = "exclusion_updating")
.
Value
Returns a qb_blocking
object with the constructed blocks.
References
Higgins, Michael J., Fredrik Sävje and Jasjeet S. Sekhon (2016), ‘Improving massive experiments with threshold blocking’, Proceedings of the National Academy of Sciences, 113:27, 7369–7376.
See Also
See sc_clustering
for the underlying function used
to construct the blocks.
Examples
# Construct example data
my_data <- data.frame(x1 = runif(100),
x2 = runif(100))
# Make distances
my_distances <- distances(my_data, dist_variables = c("x1", "x2"))
# Make blocking with at least two units in each block
quickblock(my_distances)
# Require at least three units in each block
quickblock(my_distances, size_constraint = 3)
# Impose caliper
quickblock(my_distances, caliper = 0.2)
# Break large block
quickblock(my_distances, break_large_blocks = TRUE)
# Call `quickblock` directly with covariate data (ie., not pre-calculating distances)
quickblock(my_data[c("x1", "x2")])
# Call `quickblock` directly with covariate data using Mahalanobis distances
quickblock(my_data[c("x1", "x2")], normalize = "mahalanobize")