squashData {openEBGM} | R Documentation |
Squash data for hyperparameter estimation
Description
squashData
squashes data by binning expected counts, E, for a
given actual count, N, using bin means as the expected counts for
the reduced data set. The squashed points are weighted by bin size. Data
can be squashed to reduce computational burden (see DuMouchel et al.,
2001) when estimating the hyperparameters.
Usage
squashData(
data,
count = 1,
bin_size = 50,
keep_pts = 100,
min_bin = 50,
min_pts = 500
)
Arguments
data |
A data frame (typically from |
count |
A non-negative scalar whole number for the count size, N, used for binning |
bin_size |
A scalar whole number (>= 2) |
keep_pts |
A nonnegative scalar whole number for number of points with the largest expected counts to leave unsquashed. Used to help prevent “oversquashing”. |
min_bin |
A positive scalar whole number for the minimum number of bins needed. Used to help prevent “oversquashing”. |
min_pts |
A positive scalar whole number for the minimum number of original (unsquashed) points needed for squashing. Used to help prevent “oversquashing”. |
Details
Can be used iteratively (count = 1, then 2, etc.).
The N column in data
will be coerced using
as.integer
, and E will be coerced using
as.numeric
. Missing data are not allowed.
Since the distribution of expected counts, E, tends to be
skewed to the right, the largest Es are not squashed by default.
This behavior can be changed by setting the keep_pts
argument to
zero (0); however, this is not recommended. Squashing the largest Es
could result in a large loss of information, so it is recommended to use a
value of 100 or more for keep_pts
.
Values for keep_pts
, min_bin
, and min_pts
should typically be at least as large as the default values.
Value
A data frame with column names N, E, and weight containing the reduced data set.
References
DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
See Also
processRaw
for data preparation and
autoSquash
for automatically squashing an entire data set in
one function call
Examples
set.seed(483726)
dat <- data.frame(
var1 = letters[1:26], var2 = LETTERS[1:26],
N = c(rep(0, 11), rep(1, 10), rep(2, 4), rep(3, 1)),
E = round(abs(c(rnorm(11, 0), rnorm(10, 1), rnorm(4, 2), rnorm(1, 3))), 3),
stringsAsFactors = FALSE
)
(zeroes <- squashData(dat, count = 0, bin_size = 3, keep_pts = 1,
min_bin = 2, min_pts = 2))
(ones <- squashData(zeroes, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2))
(twos <- squashData(ones, count = 2, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2))
squashData(zeroes, bin_size = 2, keep_pts = 0,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 1,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 2,
min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 3,
min_bin = 2, min_pts = 2)