R: Squash data for hyperparameter estimation

squashData {openEBGM}

R Documentation

Squash data for hyperparameter estimation

Description

squashData squashes data by binning expected counts, E, for a given actual count, N, using bin means as the expected counts for the reduced data set. The squashed points are weighted by bin size. Data can be squashed to reduce computational burden (see DuMouchel et al., 2001) when estimating the hyperparameters.

Usage

squashData(
  data,
  count = 1,
  bin_size = 50,
  keep_pts = 100,
  min_bin = 50,
  min_pts = 500
)

Arguments

`data`	A data frame (typically from `processRaw` or a previous call to `squashData`) containing columns named N, E, and (possibly) weight. Can contain additional columns, which will be ignored.
`count`	A non-negative scalar whole number for the count size, N, used for binning
`bin_size`	A scalar whole number (>= 2)
`keep_pts`	A nonnegative scalar whole number for number of points with the largest expected counts to leave unsquashed. Used to help prevent “oversquashing”.
`min_bin`	A positive scalar whole number for the minimum number of bins needed. Used to help prevent “oversquashing”.
`min_pts`	A positive scalar whole number for the minimum number of original (unsquashed) points needed for squashing. Used to help prevent “oversquashing”.

Details

Can be used iteratively (count = 1, then 2, etc.).

The N column in data will be coerced using as.integer, and E will be coerced using as.numeric. Missing data are not allowed.

Since the distribution of expected counts, E, tends to be skewed to the right, the largest Es are not squashed by default. This behavior can be changed by setting the keep_pts argument to zero (0); however, this is not recommended. Squashing the largest Es could result in a large loss of information, so it is recommended to use a value of 100 or more for keep_pts.

Values for keep_pts, min_bin, and min_pts should typically be at least as large as the default values.

Value

A data frame with column names N, E, and weight containing the reduced data set.

References

DuMouchel W, Pregibon D (2001). "Empirical Bayes Screening for Multi-item Associations." In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '01, pp. 67-76. ACM, New York, NY, USA. ISBN 1-58113-391-X.

Examples

set.seed(483726)
dat <- data.frame(
  var1 = letters[1:26], var2 = LETTERS[1:26],
  N = c(rep(0, 11), rep(1, 10), rep(2, 4), rep(3, 1)),
  E = round(abs(c(rnorm(11, 0), rnorm(10, 1), rnorm(4, 2), rnorm(1, 3))), 3),
  stringsAsFactors = FALSE
)
(zeroes <- squashData(dat, count = 0, bin_size = 3, keep_pts = 1,
                      min_bin = 2, min_pts = 2))
(ones <- squashData(zeroes, bin_size = 2, keep_pts = 1,
                    min_bin = 2, min_pts = 2))
(twos <- squashData(ones, count = 2, bin_size = 2, keep_pts = 1,
                    min_bin = 2, min_pts = 2))

squashData(zeroes, bin_size = 2, keep_pts = 0,
           min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 1,
           min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 2,
           min_bin = 2, min_pts = 2)
squashData(zeroes, bin_size = 2, keep_pts = 3,
           min_bin = 2, min_pts = 2)

[Package openEBGM version 0.9.1 Index]