accumulate {accumulate}R Documentation

Split-Apply-Combine with Collapsing Groups

Description

Compute grouped aggregates. If a group does not satisfy certain user-defined conditions (such as too many missings, or not enough records) then the group is expanded according to a user-defined 'collapsing' scheme. This happens recursively until either the group satisfies all conditions and the aggregate is computed, or we run out of collapsing possibilities and the NA is returned for that group.

Usage

accumulate(data, collapse, test, fun, ...)

cumulate(data, collapse, test, ...)

Arguments

data

[data.frame] The data to aggregate by (collapsing) groups.

collapse

[formula|data.frame] representing a group collapsing sequence. See below for details on how to specify each option.

test

[function] A function that takes a subset of data and returns TRUE if it is suitable for computing the desired aggregates and FALSE if a collapsing step is necessary.

fun

[function] A scalar function that will be applied to all columns of data.

...

For accumulate, extra arguments to be passed to fun. For cumulate, a comma-separated list of name=expression, where expression defines the aggregating operation.

Value

A data frame where each row represents a (multivariate) group. The first columns contain the grouping variables. The next column is called level and indicates to what level collapsing was necessary to compute a value, where 0 means that no collapsing was necessary. The following colummns contain the aggregates defined in the ... argument. If no amount of collapsing yields a data set that is satisfactory according to test, then for that row, the level and subsequent columns are NA.

Using a formula to define the collapsing sequence

If all combinations of collapsing options are stored as columns in data, the formula interface can be used. An example is the easiest way to see how it works. Suppose that collapse = A*B ~ A1*B + B This means:

Generally, the formula must be of the form X0 ~ X1 + X2 + ... + Xn where each Xi is a (product of) grouping variable(s) in the data set.

Using a data frame to define the collapsing scheme

In this case collapse is a data frame with columns [A0, A1, ..., An]. The variable A0 represents the most fine-grained grouping and must also be present in data. Aggregation works as follows.

Examples


## Example of data frame defining collapsing scheme, using accumulate

input    <- data.frame(Y1 = 2^(0:8), Y2 = 2^(0:8))
input$Y2[c(1,4,7)] <- NA
# make sure that the input data also has the most fine-graind (target)
# grouping variable
input$A0 <- c(123,123,123,135,136,137,212,213,225)

# define collapsing sequence
collapse <- data.frame(
     A0   = c(123, 135, 136, 137, 212, 213, 225)
   , A1   = c(12 , 13 , 13 , 13 , 21 , 21 , 22 )
   , A2   = c(1  , 1  , 1  , 1  , 2  , 2  , 2  )
)

accumulate(input
 , collapse
 , test = function(d) nrow(d)>=3
 , fun  = sum, na.rm=TRUE)


## Example of formula defining collapsing scheme, using cumulate
input <- data.frame(
   A  = c(1,1,1,2,2,2,3,3,3)
 , B  = c(11,11,11,12,12,13,21,22,12)
 , B1 = c(1,1,1,1,1,1,2,2,1)
 , Y  = 2^(0:8)
)
cumulate(input, collapse=A*B ~ A*B1 + A
        , test = function(d) nrow(d) >= 3
        , tY = sum(Y))


## Example with formula defining collapsing scheme, using accumulate
# The collapsing scheme must be represented by variables in the 
# data. All columns not part of the collapsing scheme will be aggregated
# over.

input <- data.frame(
    A  = c(1,1,1,2,2,2,3,3,3)
  , B  = c(11,11,11,12,12,13,21,22,12)
  , B1 = c(1,1,1,1,1,1,2,2,1)
  , Y1 = 2^(0:8)
  , Y2 = 2^(0:8)
)

input$Y2[c(1,4,7)] <- NA

accumulate(input
 , collapse = A*B ~ A*B1 + A
 , test=function(a) nrow(a)>=3
 , fun = sum, na.rm=TRUE)



## Example with data.frame defining collapsing scheme, using cumulate
dat <- data.frame(A0 = c("11","12","11","22"), Y = c(2,4,6,8))
# collapsing scheme
csh <- data.frame(
   A0 = c("11","12","22")
 , A1 = c("1" ,"1", "2") 
)
cumulate(data = dat
   , collapse = csh
   , test     = function(d) if (nrow(d)<2) FALSE else TRUE
   , mn = mean(Y, na.rm=TRUE)
   , md = median(Y, na.rm=TRUE)
)


[Package accumulate version 0.9.3 Index]