accumulate {accumulate} | R Documentation |
Split-Apply-Combine with Collapsing Groups
Description
Compute grouped aggregates. If a group does not satisfy certain user-defined
conditions (such as too many missings, or not enough records) then the group
is expanded according to a user-defined 'collapsing' scheme. This happens
recursively until either the group satisfies all conditions and the
aggregate is computed, or we run out of collapsing possibilities and the
NA
is returned for that group.
accumulate
aggregates over all non-grouping variables defined incollapse
cumulate
uses a syntax akin todplyr::summarise
Usage
accumulate(data, collapse, test, fun, ...)
cumulate(data, collapse, test, ...)
Arguments
data |
|
collapse |
|
test |
|
fun |
|
... |
For |
Value
A data frame where each row represents a (multivariate) group. The first
columns contain the grouping variables. The next column is called
level
and indicates to what level collapsing was necessary to compute
a value, where 0 means that no collapsing was necessary. The following
colummns contain the aggregates defined in the ...
argument. If no
amount of collapsing yields a data set that is satisfactory according to
test
, then for that row, the level
and subsequent columns are
NA
.
Using a formula to define the collapsing sequence
If all combinations of collapsing options are stored as columns in
data
, the formula
interface can be used. An example is the
easiest way to see how it works. Suppose that collapse = A*B ~ A1*B +
B
This means:
Compute output for groups defined by variables A and B
If for a certain combination
(a,b)
inAxB
the data does not pass thetest
, use(a1,b)
inA1xB
as alternative combination to compute a value for(a,b)
(A1xB
must yield larger groups thanAxB
).If that does not work, use only
B
as a grouping variable to compute a value for(a,b)
.If that does not work, return
NA
for that particular combination(a,b)
.
Generally, the formula
must be of the form X0 ~ X1 + X2 + ... +
Xn
where each Xi
is a (product of) grouping variable(s) in the data set.
Using a data frame to define the collapsing scheme
In this case collapse
is a data frame with columns [A0, A1,
..., An]
. The variable A0
represents the most fine-grained
grouping and must also be present in data
. Aggregation works
as follows.
Compute output for groups defined by variable
A0
If for a certain
a0
inA0
the corresponding selected data does not pass thetest
, use the larger dataset corresponding toa1
inA1
to compute output fora1
.Repeat the second step until either the
test
is passed or no more collapsing is possible. In the latter case, returnNA
for that particular value ofa0
.
Examples
## Example of data frame defining collapsing scheme, using accumulate
input <- data.frame(Y1 = 2^(0:8), Y2 = 2^(0:8))
input$Y2[c(1,4,7)] <- NA
# make sure that the input data also has the most fine-graind (target)
# grouping variable
input$A0 <- c(123,123,123,135,136,137,212,213,225)
# define collapsing sequence
collapse <- data.frame(
A0 = c(123, 135, 136, 137, 212, 213, 225)
, A1 = c(12 , 13 , 13 , 13 , 21 , 21 , 22 )
, A2 = c(1 , 1 , 1 , 1 , 2 , 2 , 2 )
)
accumulate(input
, collapse
, test = function(d) nrow(d)>=3
, fun = sum, na.rm=TRUE)
## Example of formula defining collapsing scheme, using cumulate
input <- data.frame(
A = c(1,1,1,2,2,2,3,3,3)
, B = c(11,11,11,12,12,13,21,22,12)
, B1 = c(1,1,1,1,1,1,2,2,1)
, Y = 2^(0:8)
)
cumulate(input, collapse=A*B ~ A*B1 + A
, test = function(d) nrow(d) >= 3
, tY = sum(Y))
## Example with formula defining collapsing scheme, using accumulate
# The collapsing scheme must be represented by variables in the
# data. All columns not part of the collapsing scheme will be aggregated
# over.
input <- data.frame(
A = c(1,1,1,2,2,2,3,3,3)
, B = c(11,11,11,12,12,13,21,22,12)
, B1 = c(1,1,1,1,1,1,2,2,1)
, Y1 = 2^(0:8)
, Y2 = 2^(0:8)
)
input$Y2[c(1,4,7)] <- NA
accumulate(input
, collapse = A*B ~ A*B1 + A
, test=function(a) nrow(a)>=3
, fun = sum, na.rm=TRUE)
## Example with data.frame defining collapsing scheme, using cumulate
dat <- data.frame(A0 = c("11","12","11","22"), Y = c(2,4,6,8))
# collapsing scheme
csh <- data.frame(
A0 = c("11","12","22")
, A1 = c("1" ,"1", "2")
)
cumulate(data = dat
, collapse = csh
, test = function(d) if (nrow(d)<2) FALSE else TRUE
, mn = mean(Y, na.rm=TRUE)
, md = median(Y, na.rm=TRUE)
)