squashData {openEBGM}R Documentation

Squash data for hyperparameter estimation

Description

squashData squashes data by binning expected counts, E, for a given actual count, N, using bin means as the expected counts for the reduced data set. The squashed points are weighted by bin size. Data can be squashed to reduce computational burden (see DuMouchel et al., 2001) when estimating the hyperparameters.

Usage

squashData(
  data,
  count = 1,
  bin_size = 50,
  keep_pts = 100,
  min_bin = 50,
  min_pts = 500
)

Arguments

data

A data frame (typically from processRaw or a previous call to squashData) containing columns named N, E, and (possibly) weight. Can contain additional columns, which will be ignored.

count

A non-negative scalar whole number for the count size, N, used for binning

bin_size

A scalar whole number (>= 2)

keep_pts

A nonnegative scalar whole number for number of points with the largest expected counts to leave unsquashed. Used to help prevent “oversquashing”.

min_bin

A positive scalar whole number for the minimum number of bins needed. Used to help prevent “oversquashing”.

min_pts

A positive scalar whole number for the minimum number of original (unsquashed) points needed for squashing. Used to help prevent “oversquashing”.

Details

Can be used iteratively (count = 1, then 2, etc.).

The N column in data will be coerced using as.integer, and E will be coerced using as.numeric. Missing data are not allowed.

Since the distribution of expected counts, E, tends to be skewed to the right, the largest Es are not squashed by default. This behavior can be changed by setting the keep_pts argument to zero (0); however, this is not recommended. Squashing the largest Es could result in a large loss of information, so it is recommended to use a value of 100 or more for keep_pts.

Values for keep_pts, min_bin, and min_pts should typically be at least as large as the default values.

Value

A data frame with column names N, E, and weight containing the reduced data set.

References

DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.

See Also

processRaw for data preparation and autoSquash for automatically squashing an entire data set in one function call

Examples

set.seed(483726)
dat <- data.frame(
  var1 = letters[1:26], var2 = LETTERS[1:26],
  N = c(rep(0, 11), rep(1, 10), rep(2, 4), rep(3, 1)),
  E = round(abs(c(rnorm(11, 0), rnorm(10, 1), rnorm(4, 2), rnorm(1, 3))), 3),
  stringsAsFactors = FALSE
)
(zeroes <- squashData(dat, count = 0, bin_size = 3, keep_pts = 1,
                      min_bin = 2, min_pts = 2))
(ones <- squashData(zeroes, bin_size = 2, keep_pts = 1,
                    min_bin = 2, min_pts = 2))
(twos <- squashData(ones, count = 2, bin_size = 2, keep_pts = 1,
                    min_bin = 2, min_pts = 2))

squashData(zeroes, bin_size = 2, keep_pts = 0,
           min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 1,
           min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 2,
           min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 3,
           min_bin = 2, min_pts = 2)


[Package openEBGM version 0.9.1 Index]