R: Analyze variables

analyze_variables {tern}

R Documentation

Analyze variables

Description

The analyze function analyze_vars() generates a summary of one or more variables, using the S3 generic function s_summary() to calculate a list of summary statistics. A list of all available statistics for numeric variables can be viewed by running get_stats("analyze_vars_numeric") and for non-numeric variables by running get_stats("analyze_vars_counts"). Use the .stats parameter to specify the statistics to include in your output summary table.

Usage

analyze_vars(
  lyt,
  vars,
  var_labels = vars,
  na_str = default_na_str(),
  nested = TRUE,
  ...,
  na.rm = TRUE,
  show_labels = "default",
  table_names = vars,
  section_div = NA_character_,
  .stats = c("n", "mean_sd", "median", "range", "count_fraction"),
  .formats = NULL,
  .labels = NULL,
  .indent_mods = NULL
)

s_summary(x, na.rm = TRUE, denom, .N_row, .N_col, .var, ...)

## S3 method for class 'numeric'
s_summary(
  x,
  na.rm = TRUE,
  denom,
  .N_row,
  .N_col,
  .var,
  control = control_analyze_vars(),
  ...
)

## S3 method for class 'factor'
s_summary(
  x,
  na.rm = TRUE,
  denom = c("n", "N_row", "N_col"),
  .N_row,
  .N_col,
  ...
)

## S3 method for class 'character'
s_summary(
  x,
  na.rm = TRUE,
  denom = c("n", "N_row", "N_col"),
  .N_row,
  .N_col,
  .var,
  verbose = TRUE,
  ...
)

## S3 method for class 'logical'
s_summary(
  x,
  na.rm = TRUE,
  denom = c("n", "N_row", "N_col"),
  .N_row,
  .N_col,
  ...
)

a_summary(
  x,
  .N_col,
  .N_row,
  .var = NULL,
  .df_row = NULL,
  .ref_group = NULL,
  .in_ref_col = FALSE,
  compare = FALSE,
  .stats = NULL,
  .formats = NULL,
  .labels = NULL,
  .indent_mods = NULL,
  na.rm = TRUE,
  na_str = default_na_str(),
  ...
)

Arguments

`lyt`	(`PreDataTableLayouts`) layout that analyses will be added to.
`vars`	(`character`) variable names for the primary analysis variable to be iterated over.
`var_labels`	(`character`) variable labels.
`na_str`	(`string`) string used to replace all `NA` or empty values in the output.
`nested`	(`flag`) whether this layout instruction should be applied within the existing layout structure _if possible (`TRUE`, the default) or as a new top-level element (`FALSE`). Ignored if it would nest a split. underneath analyses, which is not allowed.
`...`	arguments passed to `s_summary()`.
`na.rm`	(`flag`) whether `NA` values should be removed from `x` prior to analysis.
`show_labels`	(`string`) label visibility: one of "default", "visible" and "hidden".
`table_names`	(`character`) this can be customized in the case that the same `vars` are analyzed multiple times, to avoid warnings from `rtables`.
`section_div`	(`string`) string which should be repeated as a section divider after each group defined by this split instruction, or `NA_character_` (the default) for no section divider.
`.stats`	(`character`) statistics to select for the table. Run `get_stats("analyze_vars_numeric")` to see statistics available for numeric variables, and `get_stats("analyze_vars_counts")` for statistics available for non-numeric variables.
`.formats`	(named `character` or `list`) formats for the statistics. See Details in `analyze_vars` for more information on the `"auto"` setting.
`.labels`	(named `character`) labels for the statistics (without indent).
`.indent_mods`	(named `integer`) indent modifiers for the labels. Each element of the vector should be a name-value pair with name corresponding to a statistic specified in `.stats` and value the indentation for that statistic's row label.
`x`	(`numeric`) vector of numbers we want to analyze.
`denom`	(`string`) choice of denominator for proportion. Options are: `n`: number of values in this row and column intersection. `N_row`: total number of values in this row across columns. `N_col`: total number of values in this column across rows.
`.N_row`	(`integer(1)`) row-wise N (row group count) for the group of observations being analyzed (i.e. with no column-based subsetting) that is typically passed by `rtables`.
`.N_col`	(`integer(1)`) column-wise N (column count) for the full column being analyzed that is typically passed by `rtables`.
`.var`	(`string`) single variable name that is passed by `rtables` when requested by a statistics function.
`control`	(`list`) parameters for descriptive statistics details, specified by using the helper function `control_analyze_vars()`. Some possible parameter options are: `conf_level` (`proportion`) confidence level of the interval for mean and median. `quantiles` (`numeric(2)`) vector of length two to specify the quantiles. `quantile_type` (`numeric(1)`) between 1 and 9 selecting quantile algorithms to be used. See more about `type` in `stats::quantile()`. `test_mean` (`numeric(1)`) value to test against the mean under the null hypothesis when calculating p-value.
`verbose`	(`flag`) defaults to `TRUE`, which prints out warnings and messages. It is mainly used to print out information about factor casting.
`.df_row`	(`data.frame`) data frame across all of the columns for the given row split.
`.ref_group`	(`data.frame` or `vector`) the data corresponding to the reference group.
`.in_ref_col`	(`flag`) `TRUE` when working with the reference level, `FALSE` otherwise.
`compare`	(`flag`) whether comparison statistics should be analyzed instead of summary statistics (`compare = TRUE` adds `pval` statistic comparing against reference group).

Details

Automatic digit formatting: The number of digits to display can be automatically determined from the analyzed variable(s) (vars) for certain statistics by setting the statistic format to "auto" in .formats. This utilizes the format_auto() formatting function. Note that only data for the current row & variable (for all columns) will be considered (.df_row[[.var]], see rtables::additional_fun_params) and not the whole dataset.

Value

analyze_vars() returns a layout object suitable for passing to further layouting functions, or to rtables::build_table(). Adding this function to an rtable layout will add formatted rows containing the statistics from s_summary() to the table layout.

s_summary() returns different statistics depending on the class of x.

If x is of class numeric, returns a list with the following named numeric items:
- n: The length() of x.
- sum: The sum() of x.
- mean: The mean() of x.
- sd: The stats::sd() of x.
- se: The standard error of x mean, i.e.: (sd(x) / sqrt(length(x))).
- mean_sd: The mean() and stats::sd() of x.
- mean_se: The mean() of x and its standard error (see above).
- mean_ci: The CI for the mean of x (from stat_mean_ci()).
- mean_sei: The SE interval for the mean of x, i.e.: (mean() -/+ stats::sd() / sqrt()).
- mean_sdi: The SD interval for the mean of x, i.e.: (mean() -/+ stats::sd()).
- mean_pval: The two-sided p-value of the mean of x (from stat_mean_pval()).
- median: The stats::median() of x.
- mad: The median absolute deviation of x, i.e.: (stats::median() of xc, where xc = x - stats::median()).
- median_ci: The CI for the median of x (from stat_median_ci()).
- quantiles: Two sample quantiles of x (from stats::quantile()).
- iqr: The stats::IQR() of x.
- range: The range_noinf() of x.
- min: The max() of x.
- max: The min() of x.
- median_range: The median() and range_noinf() of x.
- cv: The coefficient of variation of x, i.e.: (stats::sd() / mean() * 100).
- geom_mean: The geometric mean of x, i.e.: (exp(mean(log(x)))).
- geom_cv: The geometric coefficient of variation of x, i.e.: (sqrt(exp(sd(log(x)) ^ 2) - 1) * 100).

If x is of class factor or converted from character, returns a list with named numeric items:
- n: The length() of x.
- count: A list with the number of cases for each level of the factor x.
- count_fraction: Similar to count but also includes the proportion of cases for each level of the factor x relative to the denominator, or NA if the denominator is zero.

If x is of class logical, returns a list with named numeric items:
- n: The length() of x (possibly after removing NAs).
- count: Count of TRUE in x.
- count_fraction: Count and proportion of TRUE in x relative to the denominator, or NA if the denominator is zero. Note that NAs in x are never counted or leading to NA here.

a_summary() returns the corresponding list with formatted rtables::CellValue().

Functions

analyze_vars(): Layout-creating function which can take statistics function arguments and additional format arguments. This function is a wrapper for rtables::analyze().
s_summary(): S3 generic function to produces a variable summary.
s_summary(numeric): Method for numeric class.
s_summary(factor): Method for factor class.
s_summary(character): Method for character class. This makes an automatic conversion to factor (with a warning) and then forwards to the method for factors.
s_summary(logical): Method for logical class.
a_summary(): Formatted analysis function which is used as afun in analyze_vars() and compare_vars() and as cfun in summarize_colvars().

Note

If x is an empty vector, NA is returned. This is the expected feature so as to return rcell content in rtables when the intersection of a column and a row delimits an empty data selection.
When the mean function is applied to an empty vector, NA will be returned instead of NaN, the latter being standard behavior in R.

If x is an empty factor, a list is still returned for counts with one element per factor level. If there are no levels in x, the function fails.
If factor variables contain NA, these NA values are excluded by default. To include NA values set na.rm = FALSE and missing values will be displayed as an NA level. Alternatively, an explicit factor level can be defined for NA values during pre-processing via df_explicit_na() - the default na_level ("<Missing>") will also be excluded when na.rm is set to TRUE.

Automatic conversion of character to factor does not guarantee that the table can be generated correctly. In particular for sparse tables this very likely can fail. It is therefore better to always pre-process the dataset such that factors are manually created from character variables before passing the dataset to rtables::build_table().

To use for comparison (with additional p-value statistic), parameter compare must be set to TRUE.
Ensure that either all NA values are converted to an explicit NA level or all NA values are left as is.

Examples

## Fabricated dataset.
dta_test <- data.frame(
  USUBJID = rep(1:6, each = 3),
  PARAMCD = rep("lab", 6 * 3),
  AVISIT  = rep(paste0("V", 1:3), 6),
  ARM     = rep(LETTERS[1:3], rep(6, 3)),
  AVAL    = c(9:1, rep(NA, 9))
)

# `analyze_vars()` in `rtables` pipelines
## Default output within a `rtables` pipeline.
l <- basic_table() %>%
  split_cols_by(var = "ARM") %>%
  split_rows_by(var = "AVISIT") %>%
  analyze_vars(vars = "AVAL")

build_table(l, df = dta_test)

## Select and format statistics output.
l <- basic_table() %>%
  split_cols_by(var = "ARM") %>%
  split_rows_by(var = "AVISIT") %>%
  analyze_vars(
    vars = "AVAL",
    .stats = c("n", "mean_sd", "quantiles"),
    .formats = c("mean_sd" = "xx.x, xx.x"),
    .labels = c(n = "n", mean_sd = "Mean, SD", quantiles = c("Q1 - Q3"))
  )

build_table(l, df = dta_test)

## Use arguments interpreted by `s_summary`.
l <- basic_table() %>%
  split_cols_by(var = "ARM") %>%
  split_rows_by(var = "AVISIT") %>%
  analyze_vars(vars = "AVAL", na.rm = FALSE)

build_table(l, df = dta_test)

## Handle `NA` levels first when summarizing factors.
dta_test$AVISIT <- NA_character_
dta_test <- df_explicit_na(dta_test)
l <- basic_table() %>%
  split_cols_by(var = "ARM") %>%
  analyze_vars(vars = "AVISIT", na.rm = FALSE)

build_table(l, df = dta_test)

# auto format
dt <- data.frame("VAR" = c(0.001, 0.2, 0.0011000, 3, 4))
basic_table() %>%
  analyze_vars(
    vars = "VAR",
    .stats = c("n", "mean", "mean_sd", "range"),
    .formats = c("mean_sd" = "auto", "range" = "auto")
  ) %>%
  build_table(dt)

# `s_summary.numeric`

## Basic usage: empty numeric returns NA-filled items.
s_summary(numeric())

## Management of NA values.
x <- c(NA_real_, 1)
s_summary(x, na.rm = TRUE)
s_summary(x, na.rm = FALSE)

x <- c(NA_real_, 1, 2)
s_summary(x, stats = NULL)

## Benefits in `rtables` contructions:
dta_test <- data.frame(
  Group = rep(LETTERS[1:3], each = 2),
  sub_group = rep(letters[1:2], each = 3),
  x = 1:6
)

## The summary obtained in with `rtables`:
basic_table() %>%
  split_cols_by(var = "Group") %>%
  split_rows_by(var = "sub_group") %>%
  analyze(vars = "x", afun = s_summary) %>%
  build_table(df = dta_test)

## By comparison with `lapply`:
X <- split(dta_test, f = with(dta_test, interaction(Group, sub_group)))
lapply(X, function(x) s_summary(x$x))

# `s_summary.factor`

## Basic usage:
s_summary(factor(c("a", "a", "b", "c", "a")))

# Empty factor returns zero-filled items.
s_summary(factor(levels = c("a", "b", "c")))

## Management of NA values.
x <- factor(c(NA, "Female"))
x <- explicit_na(x)
s_summary(x, na.rm = TRUE)
s_summary(x, na.rm = FALSE)

## Different denominators.
x <- factor(c("a", "a", "b", "c", "a"))
s_summary(x, denom = "N_row", .N_row = 10L)
s_summary(x, denom = "N_col", .N_col = 20L)

# `s_summary.character`

## Basic usage:
s_summary(c("a", "a", "b", "c", "a"), .var = "x", verbose = FALSE)
s_summary(c("a", "a", "b", "c", "a", ""), .var = "x", na.rm = FALSE, verbose = FALSE)

# `s_summary.logical`

## Basic usage:
s_summary(c(TRUE, FALSE, TRUE, TRUE))

# Empty factor returns zero-filled items.
s_summary(as.logical(c()))

## Management of NA values.
x <- c(NA, TRUE, FALSE)
s_summary(x, na.rm = TRUE)
s_summary(x, na.rm = FALSE)

## Different denominators.
x <- c(TRUE, FALSE, TRUE, TRUE)
s_summary(x, denom = "N_row", .N_row = 10L)
s_summary(x, denom = "N_col", .N_col = 20L)

a_summary(factor(c("a", "a", "b", "c", "a")), .N_row = 10, .N_col = 10)
a_summary(
  factor(c("a", "a", "b", "c", "a")),
  .ref_group = factor(c("a", "a", "b", "c")), compare = TRUE
)

a_summary(c("A", "B", "A", "C"), .var = "x", .N_col = 10, .N_row = 10, verbose = FALSE)
a_summary(
  c("A", "B", "A", "C"),
  .ref_group = c("B", "A", "C"), .var = "x", compare = TRUE, verbose = FALSE
)

a_summary(c(TRUE, FALSE, FALSE, TRUE, TRUE), .N_row = 10, .N_col = 10)
a_summary(
  c(TRUE, FALSE, FALSE, TRUE, TRUE),
  .ref_group = c(TRUE, FALSE), .in_ref_col = TRUE, compare = TRUE
)

a_summary(rnorm(10), .N_col = 10, .N_row = 20, .var = "bla")
a_summary(rnorm(10, 5, 1), .ref_group = rnorm(20, -5, 1), .var = "bla", compare = TRUE)

[Package tern version 0.9.5 Index]