descr {DescrTab2}R Documentation

Calculate descriptive statistics

Description

Generate a list of descriptive statistics. By default, the function calculates summary statistics such as mean, standard deviation, quantiles, minimum and maximum for continuous variables and relative and absolute frequencies for categorical variables. Also calculates p-values for an appropriately chosen statistical test. For two-group comparisons, confidence intervals for appropriate summary measures of group differences are calculated aswell. In particular, Wald confidence intervals from prop.test are used for categorical variables with 2 levels, confidence intervals from t.test are used for continuous variables and confidence intervals for the Hodges-Lehman estimator [1] from wilcox.test are used for ordinal variables.

Usage

descr(
  dat,
  group = NULL,
  group_labels = list(),
  var_labels = list(),
  var_options = list(),
  summary_stats_cont = list(N = DescrTab2:::.N, Nmiss = DescrTab2:::.Nmiss, mean =
    DescrTab2:::.mean, sd = DescrTab2:::.sd, median = DescrTab2:::.median, Q1 =
    DescrTab2:::.Q1, Q3 = DescrTab2:::.Q3, min = DescrTab2:::.min, max =
    DescrTab2:::.max),
  summary_stats_numeric_ord = list(N = DescrTab2:::.factorN, Nmiss =
    DescrTab2:::.factorNmiss, mean = DescrTab2:::.factormean, sd = DescrTab2:::.factorsd,
    median = DescrTab2:::.factormedian, Q1 = DescrTab2:::.factorQ1, Q3 =
    DescrTab2:::.factorQ3, min = DescrTab2:::.factormin, max = DescrTab2:::.factormax),
  summary_stats_cat = list(),
  format_summary_stats = list(N = function(x) {
     format(x, digits = 2, scientific =
    3)
 }, mean = function(x) {
     format(x, digits = 2, scientific = 3)
 }, sd =
    function(x) {
     format(x, digits = 2, scientific = 3)
 }, median = function(x) {
 
       format(x, digits = 2, scientific = 3)
 }, Q1 = function(x) {
     format(x, digits
    = 2, scientific = 3)
 }, Q3 = function(x) {
     format(x, digits = 2, scientific =
    3)
 }, min = function(x) {
     format(x, digits = 2, scientific = 3)
 }, max =
    function(x) {
     format(x, digits = 2, scientific = 3)
 }, CI = function(x) {
    
    format(x, digits = 2, scientific = 3)
 }),
  format_p = scales::pvalue_format(),
  format_options = list(print_Total = NULL, print_p = TRUE, print_CI = TRUE,
    combine_mean_sd = FALSE, combine_median_Q1_Q3 = FALSE, omit_factor_level = "none",
    omit_Nmiss_if_0 = TRUE, omit_missings_in_group = TRUE, percent_accuracy = NULL,
    percent_suffix = "%", row_percent = FALSE, Nmiss_row_percent = FALSE,
    absolute_relative_frequency_mode = c("both", "only_absolute", "only_relative"),
    omit_missings_in_categorical_var = FALSE, categorical_missing_percent_mode =
    c("no_missing_percent", "missing_as_regular_category", 
    
    "missing_as_separate_category"), caption = NULL, replace_empty_string_with_NA = TRUE,
    categories_first_summary_stats_second = FALSE, max_first_col_width = 7.5),
  test_options = list(paired = FALSE, nonparametric = FALSE, exact = FALSE, var_equal =
    FALSE, indices = c(), guess_id = FALSE, include_group_missings_in_test = FALSE,
    include_categorical_missings_in_test = FALSE, test_override = NULL,
    additional_test_args = list(), boschloo_max_n = 200),
  reshape_rows = list(`Q1 - Q3` = list(args = c("Q1", "Q3"), fun = function(Q1, Q3) {
 
       paste0(Q1, " -- ", Q3)
 }), `min - max` = list(args = c("min", "max"), fun =
    function(min, max) {
     paste0(min, " -- ", max)
 })),
  ...
)

Arguments

dat

Data frame or tibble. The data set to be analyzed. Can contain continuous or factor (also ordered) variables.

group

name (as character) of the group variable in dat.

group_labels

named list of labels for the levels of the group variable in dat.

var_labels

named list of variable labels.

var_options

named list of lists. For each variable, you can have special options that apply only to that variable. These options are specified in this argument. See the details and examples for more explanation.

summary_stats_cont

named list of summary statistic functions to be used for numeric variables.

summary_stats_numeric_ord

named list of summary statistic function to be used for ordered factor variables which can be converted to numeric.

summary_stats_cat

named list of summary statistic function to be used for categorical variables.

format_summary_stats

named list of formatting functions for summary statistics.

format_p

formatting function for p-values.

format_options

named list of formatting options.

test_options

named list of test options.

reshape_rows

named list of lists. Describes how to combine different summary statistics into the same row.

...

further argument to be passed along

Value

Returns a A DescrList object, which is a named list of descriptive statistics which can be passed along to the print function to create pretty summary tables.

Labels

group_labels and var_labels need to be named lists of character elements. The names of the list elements have to match the variable names in your dataset. The values of the list elements are the labels that will be assigned to these variables when printing.

Custom summary statistics

summary_stats_cont and summary_stats_cat are both named lists of functions. The names of the list elements are what will be displayed in the leftmost column of the descriptive table. These functions should take a vector and return a value.
Each summary statistic has to have an associated formatting function in the format_summary_stats list. The functions in format_summary_stats take a numeric value and convert it to a character string, e.g. 0.2531235 -> "0.2".
The format_p function converts p-values to character strings, e.g. 0.05 -> "0.05" or 0.000001 -> "<0.001".

Formatting options

Further formatting options can be specified in the format_options list. It contains the following members:

Test options

test_options is a named list with test options. It's members paired, nonparametric, and exact (logicals) control which test in the corresponding situation. For details, check out the vignette: https://imbi-heidelberg.github.io/DescrTab2/articles/b_test_choice_tree_pdf.pdf. The test_options = list(test_override="<some test name>") option can be specified to force usage of a specific test. This will produce errors if the data does not allow calculation of that specific test, so be wary. Use print_test_names() to see a list of all available test names. If paired = TRUE is specified, you need to supply an index variable indices that specifies which datapoints in your dataset are paired. indices may either be a length one character vector that describes the name of the index variable in your dataset, or a vector containing the respective indices. If you have guess_id set to TRUE (the default), DescrTab2 will try to guess the ID variable from your dataset and report a warning if it succeedes. See https://imbi-heidelberg.github.io/DescrTab2/articles/a_usage_guide.html#Paired-observations-1 for a bit more explanation. The optional list additional_test_args can be used to pass arguments along to test functions, e.g. additional_test_args=list(correct=TRUE) will request continuity correction if available.

Customization for single variables

The var_options list can be used to conduct customizations that should only apply to a single variable and leave the rest of the table unchanged.
var_options is a list of named lists. This means that each member of var_options is itself a list again. The names of the list elements of var_options determine the variables to which the options will apply. Let's say you have an age variable in your dataset. To change 'descr' options only for age, you will need to pass a list of the form var_options = list(age = list(<Your options here>)).
You can replace <Your options here> with the following options:

Combining rows

The reshape_rows argument offers a framework for combining multiple rows of the output table into a single one. reshape_rows is a named list of lists. The names of it's member-lists determine the name that will be displayed as the name of the combined summary stats in the table (e.g. "mean ± sd "). The member lists need to contain two elements: args, contains the names of the summary statistics to be combined as characters, and fun which contains a function to combine these summary stats. The argument names of this function need to match the character strings specified in args. Check out the default options for an exemplary definition.

References

[1] Hodges, J. L.; Lehmann, E. L. (1963). "Estimation of location based on ranks". Annals of Mathematical Statistics. 34 (2): 598-611. doi:10.1214/aoms/1177704172. JSTOR 2238406. MR 0152070. Zbl 0203.21105. PE euclid.aoms/1177704172

Examples

descr(iris)
DescrList <- descr(iris)
DescrList$variables$results$Sepal.Length$Total$mean
print(DescrList)
descr(iris, "Species")

[Package DescrTab2 version 2.1.16 Index]