R: Faster 'dplyr::slice()'

fslice {timeplyr}

R Documentation

Faster `dplyr::slice()`

Description

When there are lots of groups, the fslice() functions are much faster.

Usage

fslice(data, ..., .by = NULL, keep_order = FALSE, sort_groups = TRUE)

fslice_head(
  data,
  ...,
  n,
  prop,
  .by = NULL,
  keep_order = FALSE,
  sort_groups = TRUE
)

fslice_tail(
  data,
  ...,
  n,
  prop,
  .by = NULL,
  keep_order = FALSE,
  sort_groups = TRUE
)

fslice_min(
  data,
  order_by,
  ...,
  n,
  prop,
  .by = NULL,
  with_ties = TRUE,
  na_rm = FALSE,
  keep_order = FALSE,
  sort_groups = TRUE
)

fslice_max(
  data,
  order_by,
  ...,
  n,
  prop,
  .by = NULL,
  with_ties = TRUE,
  na_rm = FALSE,
  keep_order = FALSE,
  sort_groups = TRUE
)

fslice_sample(
  data,
  n,
  replace = FALSE,
  prop,
  .by = NULL,
  keep_order = FALSE,
  sort_groups = TRUE,
  weights = NULL,
  seed = NULL
)

Arguments

`data`	Data frame
`...`	See `?dplyr::slice` for details.
`.by`	(Optional). A selection of columns to group by for this operation. Columns are specified using tidy-select.
`keep_order`	Should the sliced data frame be returned in its original order? The default is `FALSE`.
`sort_groups`	If `TRUE` (the default) the by-group slices will be done in order of the sorted groups. If `FALSE` the group order is determined by first-appearance in the data.
`n`	Number of rows.
`prop`	Proportion of rows.
`order_by`	Variables to order by.
`with_ties`	Should ties be kept together? The default is `TRUE`.
`na_rm`	Should missing values in `fslice_max()` and `fslice_min()` be removed? The default is `FALSE`.
`replace`	Should `fslice_sample()` sample with or without replacement? Default is `FALSE`, without replacement.
`weights`	Probability weights used in `fslice_sample()`.
`seed`	Seed number defining RNG state. If supplied, this is only applied locally within the function and the seed state isn't retained after sampling. To clarify, whatever seed state was in place before the function call, is restored to ensure seed continuity. If left `NULL` (the default), then the seed is never modified.

Details

fslice() and friends allow for more flexibility in how you order the by-group slicing.
Furthermore, you can control whether the returned data frame is sliced in the order of the supplied row indices, or whether the original order is retained (like dplyr::filter()).

In fslice(), when length(n) == 1, an optimised method is implemented that internally uses list_subset(), a fast function for extracting single elements from single-level lists that contain vectors of the same type, e.g. integer.

fslice_head() and fslice_tail() are very fast with large numbers of groups.

fslice_sample() is arguably more intuitive as it by default resamples each entire group without replacement, without having to specify a maximum group size like in dplyr::slice_sample().

Value

A data.frame of specified rows.

Examples

library(timeplyr)
library(dplyr)
library(nycflights13)

flights <- flights %>%
  group_by(origin, dest)

# First row repeated for each group
flights %>%
  fslice(1, 1)
# First row per group
flights %>%
  fslice_head(n = 1)
# Last row per group
flights %>%
  fslice_tail(n = 1)
# Earliest flight per group
flights %>%
  fslice_min(time_hour, with_ties = FALSE)
# Last flight per group
flights %>%
  fslice_max(time_hour, with_ties = FALSE)
# Random sample without replacement by group
# (or stratified random sampling)
flights %>%
  fslice_sample()

[Package timeplyr version 0.8.1 Index]