R: Useful functions for computing descriptive statistics

utils_stats {metan}

R Documentation

Useful functions for computing descriptive statistics

Description

The following functions compute descriptive statistics by levels of a factor or combination of factors quickly.
- cv_by() For computing coefficient of variation.
- max_by() For computing maximum values.
- mean_by() For computing arithmetic means.
- min_by() For compuing minimum values.
- n_by() For getting the length.
- sd_by() For computing sample standard deviation.
- var_by() For computing sample variance.
- sem_by() For computing standard error of the mean.
Useful functions for descriptive statistics. All of them work naturally with ⁠\%>\%⁠, handle grouped data and multiple variables (all numeric variables from .data by default).
- av_dev() computes the average absolute deviation.
- ci_mean_t() computes the t-interval for the mean.
- ci_mean_z() computes the z-interval for the mean.
- cv() computes the coefficient of variation.
- freq_table() Computes a frequency table for either numeric and categorical/discrete data. For numeric data, it is possible to define the number of classes to be generated.
- ⁠hmean(), gmean()⁠ computes the harmonic and geometric means, respectively. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. The geometric mean is the nth root of n products.
- kurt() computes the kurtosis like used in SAS and SPSS.
- range_data() Computes the range of the values.
- n_valid() The valid (not NA) length of a data.
- n_unique() Number of unique values.
- n_missing() Number of missing values.
- ⁠row_col_mean(), row_col_sum()⁠ Adds a row with the mean/sum of each variable and a column with the the mean/sum for each row of the data.
- ⁠sd_amo(), sd_pop()⁠ Computes sample and populational standard deviation, respectively.
- sem() computes the standard error of the mean.
- skew() computes the skewness like used in SAS and SPSS.
- ave_dev() computes the average of the absolute deviations.
- sum_dev() computes the sum of the absolute deviations.
- sum_sq() computes the sum of the squared values.
- sum_sq_dev() computes the sum of the squared deviations.
- ⁠var_amo(), var_pop()⁠ computes sample and populational variance.

desc_stat() is wrapper function around the above ones and can be used to compute quickly all these statistics at once.

Usage

av_dev(.data, ..., na.rm = FALSE)

ci_mean_t(.data, ..., na.rm = FALSE, level = 0.95)

ci_mean_z(.data, ..., na.rm = FALSE, level = 0.95)

cv(.data, ..., na.rm = FALSE)

freq_table(.data, var, k = NULL, digits = 3)

freq_hist(
  table,
  xlab = NULL,
  ylab = NULL,
  fill = "gray",
  color = "black",
  ygrid = TRUE
)

hmean(.data, ..., na.rm = FALSE)

gmean(.data, ..., na.rm = FALSE)

kurt(.data, ..., na.rm = FALSE)

n_missing(.data, ..., na.rm = FALSE)

n_unique(.data, ..., na.rm = FALSE)

n_valid(.data, ..., na.rm = FALSE)

pseudo_sigma(.data, ..., na.rm = FALSE)

range_data(.data, ..., na.rm = FALSE)

row_col_mean(.data, na.rm = FALSE)

row_col_sum(.data, na.rm = FALSE)

sd_amo(.data, ..., na.rm = FALSE)

sd_pop(.data, ..., na.rm = FALSE)

sem(.data, ..., na.rm = FALSE)

skew(.data, ..., na.rm = FALSE)

sum_dev(.data, ..., na.rm = FALSE)

ave_dev(.data, ..., na.rm = FALSE)

sum_sq_dev(.data, ..., na.rm = FALSE)

sum_sq(.data, ..., na.rm = FALSE)

var_pop(.data, ..., na.rm = FALSE)

var_amo(.data, ..., na.rm = FALSE)

cv_by(.data, ..., .vars = NULL, na.rm = FALSE)

max_by(.data, ..., .vars = NULL, na.rm = FALSE)

min_by(.data, ..., .vars = NULL, na.rm = FALSE)

means_by(.data, ..., .vars = NULL, na.rm = FALSE)

mean_by(.data, ..., .vars = NULL, na.rm = FALSE)

n_by(.data, ..., .vars = NULL, na.rm = FALSE)

sd_by(.data, ..., .vars = NULL, na.rm = FALSE)

var_by(.data, ..., .vars = NULL, na.rm = FALSE)

sem_by(.data, ..., .vars = NULL, na.rm = FALSE)

sum_by(.data, ..., .vars = NULL, na.rm = FALSE)

Arguments

`.data`	A data frame or a numeric vector.
`...`	The argument depends on the function used. For `⁠*_by⁠` functions, `...` is one or more categorical variables for grouping the data. Then the statistic required will be computed for all numeric variables in the data. If no variables are informed in `...`, the statistic will be computed ignoring all non-numeric variables in `.data`. For the other statistics, `...` is a comma-separated of unquoted variable names to compute the statistics. If no variables are informed in n `...`, the statistic will be computed for all numeric variables in `.data`.
`na.rm`	If `FALSE`, the default, missing values are removed with a warning. If `TRUE`, missing values are silently removed.
`level`	The confidence level for the confidence interval of the mean. Defaults to 0.95.
`var`	The variable to compute the frequency table. See `Details` for more details.
`k`	The number of classes to be created. See `Details` for more details.
`digits`	The number of significant figures to show. Defaults to 2.
`table`	A frequency table computed with `freq_table()`.
`xlab`, `ylab`	The `x` and `y` labels.
`fill`, `color`	The color to fill the bars and color the border of the bar, respectively.
`ygrid`	Shows a grid line on the `y` axis? Defaults to `TRUE`. freq_hist <- function(table,
`.vars`	Used to select variables in the `⁠*_by()⁠` functions. One or more unquoted expressions separated by commas. Variable names can be used as if they were positions in the data frame, so expressions like `x:y` can be used to select a range of variables. Defaults to `NULL` (all numeric variables are analyzed)..

Details

The function freq_table() computes a frequency table for either numerical or categorical variables. If a variable is categorical or discrete (integer values), the number of classes will be the number of levels that the variable contains.

If a variable (say, data) is continuous, the number of classes (k) is given by the square root of the number of samples (n) if ⁠n =< 100⁠ or 5 * log10(n) if n > 100.

The amplitude (\(A\)) of the data is used to define the size of the class (\(c\)), given by

\[c = \frac{A}{n - 1}\]

The lower limit of the first class (LL1) is given by min(data) - c / 2. The upper limit is given by LL1 + c. The limits of the other classes are given in the same way. After the creation of the classes, the absolute and relative frequencies within each class are computed.

Value

Functions ⁠*_by()⁠ returns a tbl_df with the computed statistics by each level of the factor(s) declared in ....
All other functions return a named integer if the input is a data frame or a numeric value if the input is a numeric vector.
freq_table() Returns a list with the frequency table and the breaks used for class definition. These breaks can be used to construct an histogram of the variable.

Author(s)

Tiago Olivoto tiagoolivoto@gmail.com

References

Ferreira, Daniel Furtado. 2009. Estatistica Basica. 2 ed. Vicosa, MG: UFLA.

Examples


library(metan)
# means of all numeric variables by ENV
mean_by(data_ge2, GEN, ENV)

# Coefficient of variation for all numeric variables
# by GEN and ENV
cv_by(data_ge2, GEN, ENV)

# Skewness of a numeric vector
set.seed(1)
nvec <- rnorm(200, 10, 1)
skew(nvec)

# Confidence interval 0.95 for the mean
# All numeric variables
# Grouped by levels of ENV
data_ge2 %>%
  group_by(ENV) %>%
  ci_mean_t()

# standard error of the mean
# Variable PH and EH
sem(data_ge2, PH, EH)

# Frequency table for variable NR
data_ge2 %>%
  freq_table(NR)

[Package metan version 1.18.0 Index]