summarize_balances {groupdata2} | R Documentation |
Summarize group balances
Description
Summarize the balances of numeric, categorical, and ID columns in and between groups in one or more group columns.
This tool allows you to quickly and thoroughly assess the balance
of different columns between groups. This is for instance useful
after creating groups with fold()
,
partition()
, or
collapse_groups()
to
check how well they did and to compare multiple
groupings.
The output contains:
-
`Groups`
: a summary per group (per grouping column). -
`Summary`
: statistical descriptors of the group summaries. -
`Normalized Summary`
: statistical descriptors of a set of "normalized" group summaries. (Disabled by default)
When comparing how balanced the grouping columns are, we can use
the standard deviations of the group summary columns. The lower a standard
deviation is, the more similar the groups are in that column. To quickly
extract these standard deviations, ordered by an aggregated rank,
use ranked_balances()
on the
"Summary" data.frame
in the output.
Usage
summarize_balances(
data,
group_cols,
cat_cols = NULL,
num_cols = NULL,
id_cols = NULL,
summarize_size = TRUE,
include_normalized = FALSE,
rank_weights = NULL,
cat_levels_rank_weights = NULL,
num_normalize_fn = function(x) {
rearrr::min_max_scale(x, old_min = quantile(x,
0.025), old_max = quantile(x, 0.975), new_min = 0, new_max = 1)
}
)
Arguments
data |
Can be grouped (see |
group_cols |
Names of columns with group identifiers to summarize columns
in |
cat_cols |
Names of categorical columns to summarize. Each categorical level is counted per group. To distinguish between levels with the same name from different
Normalization when |
num_cols |
Names of numerical columns to summarize. For each column, the Normalization when |
id_cols |
Names of The number of unique IDs are counted per group. Normalization when |
summarize_size |
Whether to summarize the number of rows per group. |
include_normalized |
Whether to calculate and include the normalized summary in the output. |
rank_weights |
A named When summarizing size (see E.g. |
cat_levels_rank_weights |
Weights for averaging ranks of the categorical levels in E.g. |
num_normalize_fn |
Function for normalizing the Only used when |
Value
list
with two/three data.frames
:
Groups
A summary per group.
`cat_cols`
: Each level has its own column with the count
of the level per group.
`num_cols`
: The mean
and sum
per group.
`id_cols`
: The count of unique IDs per group.
Summary
Statistical descriptors of the columns in `Groups`
.
Contains the mean
, median
, standard deviation (SD
),
interquartile range (IQR
), min
, and max
measures.
Especially the standard deviations and IQR measures can tell us about how
balanced the groups are. When comparing multiple `group_cols`
,
the group column with the lowest SD
and IQR
can be considered the most balanced.
Normalized Summary
(Disabled by default)
Same statistical descriptors as in `Summary`
but for a
"normalized" version of the group summaries. The motivation
is that these normalized measures can more easily be compared
or combined to a single "balance score".
First, we normalize each balance column:
`cat_cols`
: The level counts in the original group summaries are
normalized with with log(1 + count)
. This eases comparison
of the statistical descriptors (especially standard deviations)
of levels with very different count scales.
`num_cols`
: The numeric columns are normalized prior to
summarization by group, using the `num_normalize_fn`
function.
By default this applies MinMax scaling to columns such that ~95% of the values
are expected to be in the [0, 1]
range.
`id_cols`
: The counts of unique IDs in the original group summaries are
normalized with log(1 + count)
.
Contains the mean
, median
, standard deviation (SD
),
interquartile range (IQR
), min
, and max
measures.
Author(s)
Ludvig Renbo Olsen, r-pkgs@ludvigolsen.dk
See Also
Other summarization functions:
ranked_balances()
,
summarize_group_cols()
Examples
# Attach packages
library(groupdata2)
library(dplyr)
set.seed(1)
# Create data frame
df <- data.frame(
"participant" = factor(rep(c("1", "2", "3", "4", "5", "6"), 3)),
"age" = rep(sample(c(1:100), 6), 3),
"diagnosis" = factor(rep(c("a", "b", "a", "a", "b", "b"), 3)),
"score" = sample(c(1:100), 3 * 6)
)
df <- df %>% arrange(participant)
df$session <- rep(c("1", "2", "3"), 6)
# Using fold()
## Without balancing
set.seed(1)
df_folded <- fold(data = df, k = 3)
# Check the balances of the various columns
# As we have not used balancing in `fold()`
# we should not expect it to be amazingly balanced
df_folded %>%
dplyr::ungroup() %>%
summarize_balances(
group_cols = ".folds",
num_cols = c("score", "age"),
cat_cols = "diagnosis",
id_cols = "participant"
)
## With balancing
set.seed(1)
df_folded <- fold(
data = df,
k = 3,
cat_col = "diagnosis",
num_col = 'score',
id_col = 'participant'
)
# Now the balance should be better
# although it may be difficult to get a good balance
# the 'score' column when also balancing on 'diagnosis'
# and keeping all rows per participant in the same fold
df_folded %>%
dplyr::ungroup() %>%
summarize_balances(
group_cols = ".folds",
num_cols = c("score", "age"),
cat_cols = "diagnosis",
id_cols = "participant"
)
# Comparing multiple grouping columns
# Create 3 fold column that only balance "score"
set.seed(1)
df_folded <- fold(
data = df,
k = 3,
num_fold_cols = 3,
num_col = 'score'
)
# Summarize all three grouping cols at once
(summ <- df_folded %>%
dplyr::ungroup() %>%
summarize_balances(
group_cols = paste0(".folds_", 1:3),
num_cols = c("score")
)
)
# Extract the across-group standard deviations
# The group column with the lowest standard deviation(s)
# is the most balanced group column
summ %>% ranked_balances()