R: Upsampling of rows in a data frame

upsample {groupdata2}

R Documentation

Upsampling of rows in a data frame

Description

Uses random upsampling to fix the group sizes to the largest group in the data frame.

Wraps balance().

Usage

upsample(
  data,
  cat_col,
  id_col = NULL,
  id_method = "n_ids",
  mark_new_rows = FALSE,
  new_rows_col_name = ".new_row"
)

Arguments

`data`	`data.frame`. Can be grouped, in which case the function is applied group-wise.
`cat_col`	Name of categorical variable to balance by. (Character)
`id_col`	Name of factor with IDs. (Character) IDs are considered entities, e.g. allowing us to add or remove all rows for an ID. How this is used is up to the `id_method`. E.g. If we have measured a participant multiple times and want make sure that we keep all these measurements. Then we would either remove/add all measurements for the participant or leave in all measurements for the participant. N.B. When `data` is a grouped `data.frame` (see `dplyr::group_by()`), IDs that appear in multiple groupings are considered separate entities within those groupings.
`id_method`	Method for balancing the IDs. (Character) `"n_ids"`, `"n_rows_c"`, `"distributed"`, or `"nested"`. n_ids (default) Balances on ID level only. It makes sure there are the same number of IDs for each category. This might lead to a different number of rows between categories. n_rows_c Attempts to level the number of rows per category, while only removing/adding entire IDs. This is done in 2 steps: If a category needs to add all its rows one or more times, the data is repeated. Iteratively, the ID with the number of rows closest to the lacking/excessive number of rows is added/removed. This happens until adding/removing the closest ID would lead to a size further from the target size than the current size. If multiple IDs are closest, one is randomly sampled. distributed Distributes the lacking/excess rows equally between the IDs. If the number to distribute can not be equally divided, some IDs will have 1 row more/less than the others. nested Calls `balance()` on each category with IDs as cat_col. I.e. if size is `"min"`, IDs will have the size of the smallest ID in their category.
`mark_new_rows`	Add column with `1`s for added rows, and `0`s for original rows. (Logical)
`new_rows_col_name`	Name of column marking new rows. Defaults to `".new_row"`.

Details

Without `id_col`

Upsampling is done with replacement for added rows, while the original data remains intact.

With `id_col`

See `id_method` description.

Value

data.frame with added rows. Ordered by potential grouping variables, `cat_col` and (potentially) `id_col`.

Author(s)

Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk

Examples

# Attach packages
library(groupdata2)

# Create data frame
df <- data.frame(
  "participant" = factor(c(1, 1, 2, 3, 3, 3, 3, 4, 4, 5, 5, 5, 5)),
  "diagnosis" = factor(c(0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0)),
  "trial" = c(1, 2, 1, 1, 2, 3, 4, 1, 2, 1, 2, 3, 4),
  "score" = sample(c(1:100), 13)
)

# Using upsample()
upsample(df, cat_col = "diagnosis")

# Using upsample() with id_method "n_ids"
# With column specifying added rows
upsample(df,
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "n_ids",
  mark_new_rows = TRUE
)

# Using upsample() with id_method "n_rows_c"
# With column specifying added rows
upsample(df,
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "n_rows_c",
  mark_new_rows = TRUE
)

# Using upsample() with id_method "distributed"
# With column specifying added rows
upsample(df,
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "distributed",
  mark_new_rows = TRUE
)

# Using upsample() with id_method "nested"
# With column specifying added rows
upsample(df,
  cat_col = "diagnosis",
  id_col = "participant",
  id_method = "nested",
  mark_new_rows = TRUE
)

[Package groupdata2 version 2.0.3 Index]

Upsampling of rows in a data frame

Description

Usage

Arguments

n_ids (default)

n_rows_c

distributed