partition {groupdata2} | R Documentation |
Create balanced partitions
Description
Splits data into partitions. Balances a given categorical variable and/or numerical variable between partitions and keeps (if possible) all data points with a shared ID (e.g. participant_id) in the same partition.
Usage
partition(
data,
p = 0.2,
cat_col = NULL,
num_col = NULL,
id_col = NULL,
id_aggregation_fn = sum,
extreme_pairing_levels = 1,
force_equal = FALSE,
list_out = TRUE
)
Arguments
data |
|
p |
List or vector of partition sizes.
Given as whole number(s) and/or percentage(s) ( E.g. |
cat_col |
Name of categorical variable to balance between partitions. E.g. when training and testing a model for predicting a binary variable (a or b), we usually want both classes represented in both the training set and the test set. N.B. If also passing an |
num_col |
Name of numerical variable to balance between partitions. N.B. When used with |
id_col |
Name of factor with IDs. Used to keep all rows that share an ID in the same partition (if possible). E.g. If we have measured a participant multiple times and want to see the effect of time, we want to have all observations of this participant in the same partition. N.B. When |
id_aggregation_fn |
Function for aggregating values in N.B. Only used when |
extreme_pairing_levels |
How many levels of extreme pairing to do
when balancing partitions by a numerical column (i.e. Extreme pairing: Rows/pairs are ordered as smallest, largest,
second smallest, second largest, etc. If N.B. Larger values work best with large datasets. If set too high,
the result might not be stochastic. Always check if an increase
actually makes the partitions more balanced. See |
force_equal |
Whether to discard excess data. (Logical) |
list_out |
Whether to return partitions in a N.B. When |
Details
cat_col
-
`data`
is subset by`cat_col`
. Subsets are partitioned and merged.
id_col
Partitions are created from unique IDs.
num_col
Rows are shuffled. Note that this will only affect rows with the same value in
`num_col`
.Extreme pairing 1: Rows are ordered as smallest, largest, second smallest, second largest, etc. Each pair get a group identifier.
If
`extreme_pairing_levels` > 1
: The group identifiers are reordered as smallest, largest, second smallest, second largest, etc., by the sum of`num_col`
in the represented rows. These pairs (of pairs) get a new set of group identifiers, and the process is repeated`extreme_pairing_levels`-2
times. Note that the group identifiers at the last level will represent2^`extreme_pairing_levels`
rows, why you should be careful when choosing that setting.The final group identifiers are shuffled, and their order is applied to the full dataset.
The ordered dataset is split by the sizes in
`p`
.
N.B. When doing extreme pairing of an unequal number of rows, the row with the largest value is placed in a group by itself, and the order is instead: smallest, second largest, second smallest, third largest, ... , largest.
cat_col AND id_col
-
`data`
is subset by`cat_col`
. Partitions are created from unique IDs in each subset.
Subsets are merged.
cat_col AND num_col
-
`data`
is subset by`cat_col`
. Subsets are partitioned by
`num_col`
.Subsets are merged.
num_col AND id_col
Values in
`num_col`
are aggregated for each ID, usingid_aggregation_fn
.The IDs are partitioned, using the aggregated values as "
num_col
".The partition identifiers are transferred to the rows of the IDs.
cat_col AND num_col AND id_col
Values in
`num_col`
are aggregated for each ID, usingid_aggregation_fn
.IDs are subset by
`cat_col`
.The IDs for each subset are partitioned, by using the aggregated values as "
num_col
".The partition identifiers are transferred to the rows of the IDs.
Value
If `list_out`
is TRUE
:
A list
of partitions where partitions are data.frame
s.
If `list_out`
is FALSE
:
A data.frame
with grouping factor for subsetting.
N.B. When `data`
is a grouped data.frame
,
the output is always a data.frame
with a grouping factor.
Author(s)
Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk
See Also
Other grouping functions:
all_groups_identical()
,
collapse_groups_by
,
collapse_groups()
,
fold()
,
group_factor()
,
group()
,
splt()
Examples
# Attach packages
library(groupdata2)
library(dplyr)
# Create data frame
df <- data.frame(
"participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)),
"age" = rep(sample(c(1:100), 6), 3),
"diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)),
"score" = sample(c(1:100), 3 * 6)
)
df <- df %>% arrange(participant)
df$session <- rep(c("1", "2", "3"), 6)
# Using partition()
# Without balancing
partitions <- partition(data = df, p = c(0.2, 0.3))
# With cat_col
partitions <- partition(data = df, p = 0.5, cat_col = "diagnosis")
# With id_col
partitions <- partition(data = df, p = 0.5, id_col = "participant")
# With num_col
partitions <- partition(data = df, p = 0.5, num_col = "score")
# With cat_col and id_col
partitions <- partition(
data = df,
p = 0.5,
cat_col = "diagnosis",
id_col = "participant"
)
# With cat_col, num_col and id_col
partitions <- partition(
data = df,
p = 0.5,
cat_col = "diagnosis",
num_col = "score",
id_col = "participant"
)
# Return data frame with grouping factor
# with list_out = FALSE
partitions <- partition(df, c(0.5), list_out = FALSE)
# Check if additional extreme_pairing_levels
# improve the numerical balance
set.seed(2) # try with seed 1 as well
partitions_1 <- partition(
data = df,
p = 0.5,
num_col = "score",
extreme_pairing_levels = 1,
list_out = FALSE
)
partitions_1 %>%
dplyr::group_by(.partitions) %>%
dplyr::summarise(
sum_score = sum(score),
mean_score = mean(score)
)
set.seed(2) # try with seed 1 as well
partitions_2 <- partition(
data = df,
p = 0.5,
num_col = "score",
extreme_pairing_levels = 2,
list_out = FALSE
)
partitions_2 %>%
dplyr::group_by(.partitions) %>%
dplyr::summarise(
sum_score = sum(score),
mean_score = mean(score)
)