| accumulate {accumulate} | R Documentation |
Split-Apply-Combine with Collapsing Groups
Description
Compute grouped aggregates. If a group does not satisfy certain user-defined
conditions (such as too many missings, or not enough records) then the group
is expanded according to a user-defined 'collapsing' scheme. This happens
recursively until either the group satisfies all conditions and the
aggregate is computed, or we run out of collapsing possibilities and the
NA is returned for that group.
accumulateaggregates over all non-grouping variables defined incollapsecumulateuses a syntax akin todplyr::summarise
Usage
accumulate(data, collapse, test, fun, ...)
cumulate(data, collapse, test, ...)
Arguments
data |
|
collapse |
|
test |
|
fun |
|
... |
For |
Value
A data frame where each row represents a (multivariate) group. The first
columns contain the grouping variables. The next column is called
level and indicates to what level collapsing was necessary to compute
a value, where 0 means that no collapsing was necessary. The following
colummns contain the aggregates defined in the ... argument. If no
amount of collapsing yields a data set that is satisfactory according to
test, then for that row, the level and subsequent columns are
NA.
Using a formula to define the collapsing sequence
If all combinations of collapsing options are stored as columns in
data, the formula interface can be used. An example is the
easiest way to see how it works. Suppose that collapse = A*B ~ A1*B +
B This means:
Compute output for groups defined by variables A and B
If for a certain combination
(a,b)inAxBthe data does not pass thetest, use(a1,b)inA1xBas alternative combination to compute a value for(a,b)(A1xBmust yield larger groups thanAxB).If that does not work, use only
Bas a grouping variable to compute a value for(a,b).If that does not work, return
NAfor that particular combination(a,b).
Generally, the formula must be of the form X0 ~ X1 + X2 + ... +
Xn where each Xi is a (product of) grouping variable(s) in the data set.
Using a data frame to define the collapsing scheme
In this case collapse is a data frame with columns [A0, A1,
..., An]. The variable A0 represents the most fine-grained
grouping and must also be present in data. Aggregation works
as follows.
Compute output for groups defined by variable
A0If for a certain
a0inA0the corresponding selected data does not pass thetest, use the larger dataset corresponding toa1inA1to compute output fora1.Repeat the second step until either the
testis passed or no more collapsing is possible. In the latter case, returnNAfor that particular value ofa0.
Examples
## Example of data frame defining collapsing scheme, using accumulate
input <- data.frame(Y1 = 2^(0:8), Y2 = 2^(0:8))
input$Y2[c(1,4,7)] <- NA
# make sure that the input data also has the most fine-graind (target)
# grouping variable
input$A0 <- c(123,123,123,135,136,137,212,213,225)
# define collapsing sequence
collapse <- data.frame(
A0 = c(123, 135, 136, 137, 212, 213, 225)
, A1 = c(12 , 13 , 13 , 13 , 21 , 21 , 22 )
, A2 = c(1 , 1 , 1 , 1 , 2 , 2 , 2 )
)
accumulate(input
, collapse
, test = function(d) nrow(d)>=3
, fun = sum, na.rm=TRUE)
## Example of formula defining collapsing scheme, using cumulate
input <- data.frame(
A = c(1,1,1,2,2,2,3,3,3)
, B = c(11,11,11,12,12,13,21,22,12)
, B1 = c(1,1,1,1,1,1,2,2,1)
, Y = 2^(0:8)
)
cumulate(input, collapse=A*B ~ A*B1 + A
, test = function(d) nrow(d) >= 3
, tY = sum(Y))
## Example with formula defining collapsing scheme, using accumulate
# The collapsing scheme must be represented by variables in the
# data. All columns not part of the collapsing scheme will be aggregated
# over.
input <- data.frame(
A = c(1,1,1,2,2,2,3,3,3)
, B = c(11,11,11,12,12,13,21,22,12)
, B1 = c(1,1,1,1,1,1,2,2,1)
, Y1 = 2^(0:8)
, Y2 = 2^(0:8)
)
input$Y2[c(1,4,7)] <- NA
accumulate(input
, collapse = A*B ~ A*B1 + A
, test=function(a) nrow(a)>=3
, fun = sum, na.rm=TRUE)
## Example with data.frame defining collapsing scheme, using cumulate
dat <- data.frame(A0 = c("11","12","11","22"), Y = c(2,4,6,8))
# collapsing scheme
csh <- data.frame(
A0 = c("11","12","22")
, A1 = c("1" ,"1", "2")
)
cumulate(data = dat
, collapse = csh
, test = function(d) if (nrow(d)<2) FALSE else TRUE
, mn = mean(Y, na.rm=TRUE)
, md = median(Y, na.rm=TRUE)
)