R: Split-Apply-Combine with Collapsing Groups

accumulate {accumulate}

R Documentation

Split-Apply-Combine with Collapsing Groups

Description

Compute grouped aggregates. If a group does not satisfy certain user-defined conditions (such as too many missings, or not enough records) then the group is expanded according to a user-defined 'collapsing' scheme. This happens recursively until either the group satisfies all conditions and the aggregate is computed, or we run out of collapsing possibilities and the NA is returned for that group.

accumulate aggregates over all non-grouping variables defined in collapse
cumulate uses a syntax akin to dplyr::summarise

Usage

accumulate(data, collapse, test, fun, ...)

cumulate(data, collapse, test, ...)

Arguments

`data`	`[data.frame]` The data to aggregate by (collapsing) groups.
`collapse`	`[formula\|data.frame]` representing a group collapsing sequence. See below for details on how to specify each option.
`test`	`[function]` A function that takes a subset of `data` and returns `TRUE` if it is suitable for computing the desired aggregates and `FALSE` if a collapsing step is necessary.
`fun`	`[function]` A scalar function that will be applied to all columns of `data`.
`...`	For `accumulate`, extra arguments to be passed to `fun`. For `cumulate`, a comma-separated list of `name=expression`, where `expression` defines the aggregating operation.

Value

A data frame where each row represents a (multivariate) group. The first columns contain the grouping variables. The next column is called level and indicates to what level collapsing was necessary to compute a value, where 0 means that no collapsing was necessary. The following colummns contain the aggregates defined in the ... argument. If no amount of collapsing yields a data set that is satisfactory according to test, then for that row, the level and subsequent columns are NA.

Using a formula to define the collapsing sequence

If all combinations of collapsing options are stored as columns in data, the formula interface can be used. An example is the easiest way to see how it works. Suppose that collapse = A*B ~ A1*B + B This means:

Compute output for groups defined by variables A and B
If for a certain combination (a,b) in AxB the data does not pass the test, use (a1,b) in A1xB as alternative combination to compute a value for (a,b) (A1xB must yield larger groups than AxB).
If that does not work, use only B as a grouping variable to compute a value for (a,b).
If that does not work, return NA for that particular combination (a,b).

Generally, the formula must be of the form X0 ~ X1 + X2 + ... + Xn where each Xi is a (product of) grouping variable(s) in the data set.

Using a data frame to define the collapsing scheme

In this case collapse is a data frame with columns [A0, A1, ..., An]. The variable A0 represents the most fine-grained grouping and must also be present in data. Aggregation works as follows.

Compute output for groups defined by variable A0
If for a certain a0 in A0 the corresponding selected data does not pass the test, use the larger dataset corresponding to a1 in A1 to compute output for a1.
Repeat the second step until either the test is passed or no more collapsing is possible. In the latter case, return NA for that particular value of a0.

Examples


## Example of data frame defining collapsing scheme, using accumulate

input    <- data.frame(Y1 = 2^(0:8), Y2 = 2^(0:8))
input$Y2[c(1,4,7)] <- NA
# make sure that the input data also has the most fine-graind (target)
# grouping variable
input$A0 <- c(123,123,123,135,136,137,212,213,225)

# define collapsing sequence
collapse <- data.frame(
     A0   = c(123, 135, 136, 137, 212, 213, 225)
   , A1   = c(12 , 13 , 13 , 13 , 21 , 21 , 22 )
   , A2   = c(1  , 1  , 1  , 1  , 2  , 2  , 2  )
)

accumulate(input
 , collapse
 , test = function(d) nrow(d)>=3
 , fun  = sum, na.rm=TRUE)


## Example of formula defining collapsing scheme, using cumulate
input <- data.frame(
   A  = c(1,1,1,2,2,2,3,3,3)
 , B  = c(11,11,11,12,12,13,21,22,12)
 , B1 = c(1,1,1,1,1,1,2,2,1)
 , Y  = 2^(0:8)
)
cumulate(input, collapse=A*B ~ A*B1 + A
        , test = function(d) nrow(d) >= 3
        , tY = sum(Y))


## Example with formula defining collapsing scheme, using accumulate
# The collapsing scheme must be represented by variables in the 
# data. All columns not part of the collapsing scheme will be aggregated
# over.

input <- data.frame(
    A  = c(1,1,1,2,2,2,3,3,3)
  , B  = c(11,11,11,12,12,13,21,22,12)
  , B1 = c(1,1,1,1,1,1,2,2,1)
  , Y1 = 2^(0:8)
  , Y2 = 2^(0:8)
)

input$Y2[c(1,4,7)] <- NA

accumulate(input
 , collapse = A*B ~ A*B1 + A
 , test=function(a) nrow(a)>=3
 , fun = sum, na.rm=TRUE)



## Example with data.frame defining collapsing scheme, using cumulate
dat <- data.frame(A0 = c("11","12","11","22"), Y = c(2,4,6,8))
# collapsing scheme
csh <- data.frame(
   A0 = c("11","12","22")
 , A1 = c("1" ,"1", "2") 
)
cumulate(data = dat
   , collapse = csh
   , test     = function(d) if (nrow(d)<2) FALSE else TRUE
   , mn = mean(Y, na.rm=TRUE)
   , md = median(Y, na.rm=TRUE)
)

[Package accumulate version 0.9.3 Index]