R: Downsampling of rows in a data frame

downsample {groupdata2}

R Documentation

Downsampling of rows in a data frame

Description

Uses random downsampling to fix the group sizes to the smallest group in the data.frame.

Wraps balance().

Usage

downsample(data, cat_col, id_col = NULL, id_method = "n_ids")

Arguments

`data`	`data.frame`. Can be grouped, in which case the function is applied group-wise.
`cat_col`	Name of categorical variable to balance by. (Character)
`id_col`	Name of factor with IDs. (Character) IDs are considered entities, e.g. allowing us to add or remove all rows for an ID. How this is used is up to the `id_method`. E.g. If we have measured a participant multiple times and want make sure that we keep all these measurements. Then we would either remove/add all measurements for the participant or leave in all measurements for the participant. N.B. When `data` is a grouped `data.frame` (see `dplyr::group_by()`), IDs that appear in multiple groupings are considered separate entities within those groupings.
`id_method`	Method for balancing the IDs. (Character) `"n_ids"`, `"n_rows_c"`, `"distributed"`, or `"nested"`. n_ids (default) Balances on ID level only. It makes sure there are the same number of IDs for each category. This might lead to a different number of rows between categories. n_rows_c Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done in 2 steps: If a category needs to add all its rows one or more times, the data is repeated. Iteratively, the ID with the number of rows closest to the lacking/excessive number of rows is added/removed. This happens until adding/removing the closest ID would lead to a size further from the target size than the current size. If multiple IDs are closest, one is randomly sampled. distributed Distributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others. nested Calls `balance()` on each category with IDs as cat_col. I.e. if size is `"min"`, IDs will have the size of the smallest ID in their category.

Details

Without `id_col`

Downsampling is done without replacement, meaning that rows are not duplicated but only removed.

With `id_col`

See `id_method` description.

Value

data.frame with some rows removed. Ordered by potential grouping variables, `cat_col` and (potentially) `id_col`.

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Examples

# Attach packages
library(groupdata2)

# Create data frame
df <- data.frame(
  "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)),
  "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)),
  "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4),
  "score" = sample(c(1:100), 13)
)

# Using downsample()
downsample(df, cat_col = "diagnosis")

# Using downsample() with id_method "n_ids"
# With column specifying added rows
downsample(df,
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "n_ids"
)

# Using downsample() with id_method "n_rows_c"
# With column specifying added rows
downsample(df,
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "n_rows_c"
)

# Using downsample() with id_method "distributed"
downsample(df,
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "distributed"
)

# Using downsample() with id_method "nested"
downsample(df,
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "nested"
)

[Package groupdata2 version 2.0.3 Index]

Downsampling of rows in a data frame

Description

Usage

Arguments

n_ids (default)

n_rows_c

distributed