GRP {collapse} | R Documentation |
Fast Grouping / collapse Grouping Objects
Description
GRP
performs fast, ordered and unordered, groupings of vectors and data frames (or lists of vectors) using radixorderv
or group
. The output is a list-like object of class 'GRP' which can be printed, plotted and used as an efficient input to all of collapse's fast statistical and transformation functions and operators (see macros .FAST_FUN
and .OPERATOR_FUN
), as well as to collap
, BY
and TRA
.
fgroup_by
is similar to dplyr::group_by
but faster and class-agnostic. It creates a grouped data frame with a 'GRP' object attached - for fast dplyr-like programming with collapse's fast functions.
There are also several conversion methods to and from 'GRP' objects. Notable among these is GRP.grouped_df
, which returns a 'GRP' object from a grouped data frame created with dplyr::group_by
or fgroup_by
, and the duo GRP.factor
and as_factor_GRP
.
gsplit
efficiently splits a vector based on a 'GRP' object, and greorder
helps to recombine the results. These are the workhorses behind functions like BY
, and collap
, fsummarise
and fmutate
when evaluated with base R and user-defined functions.
Usage
GRP(X, ...)
## Default S3 method:
GRP(X, by = NULL, sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE,
return.groups = TRUE, return.order = sort, method = "auto",
call = TRUE, ...)
## S3 method for class 'factor'
GRP(X, ..., group.sizes = TRUE, drop = FALSE, return.groups = TRUE,
call = TRUE)
## S3 method for class 'qG'
GRP(X, ..., group.sizes = TRUE, return.groups = TRUE, call = TRUE)
## S3 method for class 'pseries'
GRP(X, effect = 1L, ..., group.sizes = TRUE, return.groups = TRUE,
call = TRUE)
## S3 method for class 'pdata.frame'
GRP(X, effect = 1L, ..., group.sizes = TRUE, return.groups = TRUE,
call = TRUE)
## S3 method for class 'grouped_df'
GRP(X, ..., return.groups = TRUE, call = TRUE)
# Identify 'GRP' objects
is_GRP(x)
## S3 method for class 'GRP'
length(x) # Length of data being grouped
GRPN(x, expand = TRUE, ...) # Group sizes (default: expanded to match data length)
GRPid(x, sort = FALSE, ...) # Group id (data length, same as GRP(.)$group.id)
GRPnames(x, force.char = TRUE, sep = ".") # Group names
as_factor_GRP(x, ordered = FALSE, sep = ".") # 'GRP'-object to (ordered) factor conversion
# Efficiently split a vector using a 'GRP' object
gsplit(x, g, use.g.names = FALSE, ...)
# Efficiently reorder y = unlist(gsplit(x, g)) such that identical(greorder(y, g), x)
greorder(x, g, ...)
# Fast, class-agnostic pendant to dplyr::group_by for use with fast functions, see details
fgroup_by(.X, ..., sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE,
return.groups = TRUE, return.order = sort, method = "auto")
# Standard-evaluation analogue (slim wrapper around GRP.default(), for programming)
group_by_vars(X, by = NULL, ...)
# Shorthand for fgroup_by
gby(.X, ..., sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE,
return.groups = TRUE, return.order = sort, method = "auto")
# Get grouping columns from a grouped data frame created with dplyr::group_by or fgroup_by
fgroup_vars(X, return = "data")
# Ungroup grouped data frame created with dplyr::group_by or fgroup_by
fungroup(X, ...)
## S3 method for class 'GRP'
print(x, n = 6, ...)
## S3 method for class 'GRP'
plot(x, breaks = "auto", type = "l", horizontal = FALSE, ...)
Arguments
X |
a vector, list of columns or data frame (default method), or a suitable object (conversion / extractor methods). | |||||||||||||||||||||||||||||||||||||||||
.X |
a data frame or list. | |||||||||||||||||||||||||||||||||||||||||
x , g |
a 'GRP' object. For | |||||||||||||||||||||||||||||||||||||||||
by |
if | |||||||||||||||||||||||||||||||||||||||||
sort |
logical. If | |||||||||||||||||||||||||||||||||||||||||
ordered |
logical. | |||||||||||||||||||||||||||||||||||||||||
decreasing |
logical. Should the sort order be increasing or decreasing? Can be a vector of length equal to the number of arguments in | |||||||||||||||||||||||||||||||||||||||||
na.last |
logical. If missing values are encountered in grouping vector/columns, assign them to the last group (argument passed to | |||||||||||||||||||||||||||||||||||||||||
return.groups |
logical. Include the unique groups in the created GRP object. | |||||||||||||||||||||||||||||||||||||||||
return.order |
logical. If | |||||||||||||||||||||||||||||||||||||||||
method |
character. The algorithm to use for grouping: either | |||||||||||||||||||||||||||||||||||||||||
group.sizes |
logical. | |||||||||||||||||||||||||||||||||||||||||
drop |
logical. | |||||||||||||||||||||||||||||||||||||||||
call |
logical. | |||||||||||||||||||||||||||||||||||||||||
expand |
logical. | |||||||||||||||||||||||||||||||||||||||||
force.char |
logical. Always output group names as character vector, even if a single numeric vector was passed to | |||||||||||||||||||||||||||||||||||||||||
sep |
character. The separator passed to | |||||||||||||||||||||||||||||||||||||||||
effect |
plm / indexed data methods: Select which panel identifier should be used as grouping variable. 1L takes the first variable in the index, 2L the second etc., identifiers can also be passed as a character string. More than one variable can be supplied. | |||||||||||||||||||||||||||||||||||||||||
return |
an integer or string specifying what
| |||||||||||||||||||||||||||||||||||||||||
use.g.names |
logical. | |||||||||||||||||||||||||||||||||||||||||
n |
integer. Number of groups to print out. | |||||||||||||||||||||||||||||||||||||||||
breaks |
integer. Number of breaks in the histogram of group-sizes. | |||||||||||||||||||||||||||||||||||||||||
type |
linetype for plot. | |||||||||||||||||||||||||||||||||||||||||
horizontal |
logical. | |||||||||||||||||||||||||||||||||||||||||
... |
for |
Details
GRP
is a central function in the collapse package because it provides, in the form of integer vectors, some key pieces of information to efficiently perform grouped operations at the C/C++
level.
Most statistical function require information about (1) the number of groups (2) an integer group-id indicating which values / rows belong to which group and (3) information about the size of each group. Provided with these, collapse's Fast Statistical Functions pre-allocate intermediate and result vectors of the right sizes and (in most cases) perform grouped statistical computations in a single pass through the data.
The sorting functionality of GRP.default
lets groups receive different integer-id's depending on whether the groups are sorted sort = TRUE
(FALSE
gives first-appearance order), and in which order (argument decreasing
). This affects the order of values/rows in the output whenever an aggregation is performed.
Other elements in the object provide information about whether the data was sorted by the variables defining the grouping (6) and the ordering vector (7). These also feed into optimizations in gsplit/greorder
that benefit the execution of base R functions across groups.
Complimentary to GRP
, the function fgroup_by
is a significantly faster and class-agnostic alternative to dplyr::group_by
for programming with collapse. It creates a grouped data frame with a 'GRP' object attached in a "groups"
attribute. This data frame has classes 'GRP_df', ..., 'grouped_df' and 'data.frame', where ... stands for any other classes the input frame inherits such as 'data.table', 'sf', 'tbl_df', 'indexed_frame' etc.. collapse functions with a 'grouped_df' method respond to 'grouped_df' objects created with either fgroup_by
or dplyr::group_by
. The method GRP.grouped_df
takes the "groups"
attribute from a 'grouped_df' and converts it to a 'GRP' object if created with dplyr::group_by
.
The 'GRP_df' class in front responds to print.GRP_df
which first calls print(fungroup(x), ...)
and prints one line below the object indicating the grouping variables, followed, in square brackets, by some statistics on the group sizes: [N | Mean (SD) Min-Max]
. The mean is rounded to a full number and the standard deviation (SD) to one digit. Minimum and maximum are only displayed if the SD is non-zero. There also exist a method [.GRP_df
which calls NextMethod
but makes sure that the grouping information is preserved or dropped depending on the dimensions of the result (subsetting rows or aggregation with data.table drops the grouping object).
GRP.default
supports vector and list input and will also return 'GRP' objects if passed. There is also a hidden method GRP.GRP
which simply returns grouping objects (no re-grouping functionality is offered).
Apart from GRP.grouped_df
there are several further conversion methods:
The conversion of factors to 'GRP' objects by GRP.factor
involves obtaining the number of groups calling ng <- fnlevels(f)
and then computing the count of each level using tabulate(f, ng)
. The integer group-id (2) is already given by the factor itself after removing the levels and class attributes and replacing any missing values with ng + 1L
. The levels are put in a list and moved to position (4) in the 'GRP' object, which is reserved for the unique groups. Finally, a sortedness check !is.unsorted(id)
is run on the group-id to check if the data represented by the factor was sorted (6). GRP.qG
works similarly (see also qG
), and the 'pseries' and 'pdata.frame' methods simply group one or more factors in the index (selected using the effect
argument) .
Creating a factor from a 'GRP' object using as_factor_GRP
does not involve any computations, but may involve interacting multiple grouping columns using the paste
function to produce unique factor levels.
Value
A list-like object of class ‘GRP’ containing information about the number of groups, the observations (rows) belonging to each group, the size of each group, the unique group names / definitions, whether the groups are ordered and data grouped is sorted or not, the ordering vector used to perform the ordering and the group start positions. The object is structured as follows:
List-index | Element-name | Content type | Content description | |||
[[1]] | N.groups | integer(1) | Number of Groups | |||
[[2]] | group.id | integer(NROW(X)) | An integer group-identifier | |||
[[3]] | group.sizes | integer(N.groups) | Vector of group sizes | |||
[[4]] | groups | unique(X) or NULL | Unique groups (same format as input, except for fgroup_by which uses a plain list, sorted if sort = TRUE ), or NULL if return.groups = FALSE |
|||
[[5]] | group.vars | character | The names of the grouping variables | |||
[[6]] | ordered | logical(2) | [1] Whether the groups are ordered: equal to the sort argument in the default method, or TRUE if converted objects inherit a class "ordered" and NA otherwise, [2] Whether the data (X ) is already sorted: the result of !is.unsorted(group.id) . If sort = FALSE (default method) the second entry is NA . |
|||
[[7]] | order | integer(NROW(X)) or NULL | Ordering vector from radixorderv (with "starts" attribute), or NULL if return.order = FALSE |
|||
[[8]] | group.starts | integer(N.groups) or NULL | The first-occurrence positions/rows of the groups. Useful e.g. with ffirst(x, g, na.rm = FALSE) . NULL if return.groups = FALSE . |
|||
[[9]] | call | match.call() or NULL | The GRP() call, obtained from match.call() , or NULL if call = FALSE
|
See Also
radixorder
, group
, qF
, Fast Grouping and Ordering, Collapse Overview
Examples
## default method
GRP(mtcars$cyl)
GRP(mtcars, ~ cyl + vs + am) # Or GRP(mtcars, c("cyl","vs","am")) or GRP(mtcars, c(2,8:9))
g <- GRP(mtcars, ~ cyl + vs + am) # Saving the object
print(g) # Printing it
plot(g) # Plotting it
GRPnames(g) # Retain group names
GRPid(g) # Retain group id (same as g$group.id), useful inside fmutate()
fsum(mtcars, g) # Compute the sum of mtcars, grouped by variables cyl, vs and am
gsplit(mtcars$mpg, g) # Use the object to split a vector
gsplit(NULL, g) # The indices of the groups
identical(mtcars$mpg, # greorder and unlist undo the effect of gsplit
greorder(unlist(gsplit(mtcars$mpg, g)), g))
## Convert factor to GRP object and vice-versa
GRP(iris$Species)
as_factor_GRP(g)
## dplyr integration
library(dplyr)
mtcars |> group_by(cyl,vs,am) |> GRP() # Get GRP object from a dplyr grouped tibble
mtcars |> group_by(cyl,vs,am) |> fmean() # Grouped mean using dplyr grouping
mtcars |> fgroup_by(cyl,vs,am) |> fmean() # Faster alternative with collapse grouping
mtcars |> fgroup_by(cyl,vs,am) # Print method for grouped data frame
## Adding a column of group sizes.
mtcars |> fgroup_by(cyl,vs,am) |> fsummarise(Sizes = GRPN())
# Note: can also set_collapse(mask = "n") to use n() instead, see help("collapse-options")
# Other usage modes:
mtcars |> fgroup_by(cyl,vs,am) |> fmutate(Sizes = GRPN())
mtcars |> fmutate(Sizes = GRPN(list(cyl,vs,am))) # Same thing, slightly more efficient
## Various options for programming and interactive use
fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10) |> head(3)
fgroup_by(GGDC10S, 1:3, 5) |> head(3)
fgroup_by(GGDC10S, c("Variable", "Country")) |> head(3)
fgroup_by(GGDC10S, is.character) |> head(3)
fgroup_by(GGDC10S, Country:Variable, Year) |> head(3)
fgroup_by(GGDC10S, Country:Region, Var = Variable, Year) |> head(3)
## Note that you can create a grouped data frame without materializing the unique grouping columns
fgroup_by(GGDC10S, Variable, Country, return.groups = FALSE) |> fmutate(across(AGR:SUM, fscale))
fgroup_by(GGDC10S, Variable, Country, return.groups = FALSE) |> fselect(AGR:SUM) |> fmean()
## Note also that setting sort = FALSE on unsorted data can be much faster... if not required...
library(microbenchmark)
microbenchmark(gby(GGDC10S, Variable, Country), gby(GGDC10S, Variable, Country, sort = FALSE))