autoSquash {openEBGM} | R Documentation |
Automated data squashing
Description
autoSquash
squashes data by calling squashData
once for
each count (N), removing the need to repeatedly squash the same data
set.
Usage
autoSquash(
data,
keep_pts = c(100, 75, 50, 25),
cut_offs = c(500, 1000, 10000, 1e+05, 5e+05, 1e+06, 5e+06),
num_super_pts = c(50, 75, 150, 500, 750, 1000, 2000, 5000)
)
Arguments
data |
A data frame (typically from |
keep_pts |
A vector of whole numbers for the number of points to leave unsquashed for each count (N). See the 'Details' section. |
cut_offs |
A vector of whole numbers for the cutoff values of unsquashed data used to determine how many "super points" to end up with after squashing each count (N). See the 'Details' section. |
num_super_pts |
A vector of whole numbers for the number of
"super points" to end up with after squashing each count (N). Length
must be 1 more than length of |
Details
See squashData
for details on squashing a given
count (N).
The elements in keep_pts
determine how many points are left
unsquashed for each count (N). The first element in keep_pts
is used for the smallest N (usually 1). Each successive element is
used for each successive N. Once the last element is reached, it is
used for all other N.
For counts that are squashed, cut_offs
and
num_super_pts
determine how the points are squashed. For instance,
by default, if a given N contains less than 500 points to be
squashed, then those points are squashed to 50 "super points".
Value
A data frame with column names N, E, and weight containing the reduced data set.
References
DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.
See Also
processRaw
for data preparation and
squashData
for squashing individual counts
Examples
data.table::setDTthreads(2) #only needed for CRAN checks
data(caers)
proc <- processRaw(caers)
table(proc$N)
squash1 <- autoSquash(proc)
ftable(squash1[, c("N", "weight")])
## Not run: squash2 <- autoSquash(proc, keep_pts = c(50, 5))
## Not run: ftable(squash2[, c("N", "weight")])
## Not run:
squash3 <- autoSquash(proc, keep_pts = 100,
cut_offs = c(250, 500),
num_super_pts = c(20, 60, 125))
## End(Not run)
## Not run: ftable(squash3[, c("N", "weight")])