R: Fast group IDs

group_id {timeplyr}

R Documentation

Fast group IDs

Description

These are tidy-based functions for calculating group IDs, row IDs and group orders.

group_id() returns an integer vector of group IDs the same size as the data.
row_id() returns an integer vector of row IDs.
group_order() returns the order of the groups.

The add_ variants add a column of group IDs/row IDs/group orders.

Usage

group_id(
  data,
  ...,
  order = TRUE,
  ascending = TRUE,
  .by = NULL,
  .cols = NULL,
  as_qg = FALSE
)

add_group_id(
  data,
  ...,
  order = TRUE,
  ascending = TRUE,
  .by = NULL,
  .cols = NULL,
  .name = NULL,
  as_qg = FALSE
)

row_id(data, ..., ascending = TRUE, .by = NULL, .cols = NULL)

## S3 method for class 'GRP'
row_id(data, ascending = TRUE, ...)

add_row_id(data, ..., ascending = TRUE, .by = NULL, .cols = NULL, .name = NULL)

group_order(data, ..., ascending = TRUE, .by = NULL, .cols = NULL)

add_group_order(
  data,
  ...,
  ascending = TRUE,
  .by = NULL,
  .cols = NULL,
  .name = NULL
)

Arguments

`data`	A data frame or vector.
`...`	Additional groups using tidy `data-masking` rules. To specify groups using `tidyselect`, simply use the `.by` argument.
`order`	Should the groups be ordered? THE PHYSICAL ORDER OF THE DATA IS NOT CHANGED. When order is `TRUE` (the default) the group IDs will be ordered but not sorted. The expression identical(order(x, na.last = TRUE), order(group_id(x, order = TRUE))) or in the case of a data frame identical(order(x1, x2, x3, na.last = TRUE), order(group_id(data, x1, x2, x3, order = TRUE))) should always hold. If `FALSE` the order of the group IDs will be based on first appearance.
`ascending`	Should the group order be ascending or descending? The default is `TRUE`. For `row_id()` this determines if the row IDs are increasing or decreasing. NOTE - When `order = FALSE`, the `ascending` argument is ignored. This is something that will be fixed in a later version.
`.by`	Alternative way of supplying groups using `tidyselect` notation.
`.cols`	(Optional) alternative to `...` that accepts a named character vector or numeric vector. If speed is an expensive resource, it is recommended to use this.
`as_qg`	Should the group IDs be returned as a collapse "qG" class? The default (`FALSE`) always returns an integer vector.
`.name`	Name of the added ID column which should be a character vector of length 1. If `.name = NULL` (the default), `add_group_id()` will add a column named "group_id", and if one already exists, a unique name will be used.

Details

It's important to note for data frames, these functions by default assume no groups unless you supply them.

This means that when no groups are supplied:

group_id(iris) returns a vector of ones
row_id(iris) returns the plain row id numbers
group_order(iris) == row_id(iris).

One can specify groups in the second argument like so:

group_id(iris, Species)
row_id(iris, across(all_of("Species")))
group_order(iris, across(where(is.numeric), desc))

If you want group_id to always use all the columns of a data frame for grouping while simultaneously utilising the group_id methods, one can use the below function.

group_id2 <- function(data, ...){
 group_id(data, ..., .cols = names(data))
}

Value

An integer vector.

Examples

library(timeplyr)
library(dplyr)
library(ggplot2)

group_id(iris) # No groups
group_id(iris, Species) # Species groups
row_id(iris) # Plain row IDs
row_id(iris, Species) # Row IDs by group
# Order of Species + descending Petal.Width
group_order(iris, Species, desc(Petal.Width))
# Same as
order(iris$Species, -xtfrm(iris$Petal.Width))

# Tidy data-masking/tidyselect can be used
group_id(iris, across(where(is.numeric))) # Groups across numeric values
# Alternatively using tidyselect
group_id(iris, .by = where(is.numeric))

# Group IDs using a mixtured order
group_id(iris, desc(Species), Sepal.Length, desc(Petal.Width))

# add_ helpers
iris %>%
  distinct(Species) %>%
  add_group_id(Species)
iris %>%
  add_row_id(Species) %>%
  pull(row_id)

# Usage in data.table
library(data.table)
iris_dt <- as.data.table(iris)
iris_dt[, group_id := group_id(.SD, .cols = names(.SD)),
        .SDcols = "Species"]

# Or if you're using this often you can write a wrapper
set_add_group_id <- function(x, ..., .name = "group_id"){
  id <- group_id(x, ...)
  data.table::set(x, j = .name, value = id)
}
set_add_group_id(iris_dt, desc(Species))[]

mm_mpg <- mpg %>%
  select(manufacturer, model) %>%
  arrange(desc(pick(everything())))

# Sorted/non-sorted groups
mm_mpg %>%
  add_group_id(across(everything()),
               .name = "sorted_id", order = TRUE) %>%
  add_group_id(manufacturer, model,
               .name = "not_sorted_id", order = FALSE) %>%
  distinct()

[Package timeplyr version 0.8.1 Index]