utils_stats {metan} | R Documentation |
Useful functions for computing descriptive statistics
Description
-
The following functions compute descriptive statistics by levels of a factor or combination of factors quickly.
-
cv_by()
For computing coefficient of variation. -
max_by()
For computing maximum values. -
mean_by()
For computing arithmetic means. -
min_by()
For compuing minimum values. -
n_by()
For getting the length. -
sd_by()
For computing sample standard deviation. -
var_by()
For computing sample variance. -
sem_by()
For computing standard error of the mean.
-
-
Useful functions for descriptive statistics. All of them work naturally with
\%>\%
, handle grouped data and multiple variables (all numeric variables from.data
by default).-
av_dev()
computes the average absolute deviation. -
ci_mean_t()
computes the t-interval for the mean. -
ci_mean_z()
computes the z-interval for the mean. -
cv()
computes the coefficient of variation. -
freq_table()
Computes a frequency table for either numeric and categorical/discrete data. For numeric data, it is possible to define the number of classes to be generated. -
hmean(), gmean()
computes the harmonic and geometric means, respectively. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. The geometric mean is the nth root of n products. -
kurt()
computes the kurtosis like used in SAS and SPSS. -
range_data()
Computes the range of the values. -
n_valid()
The valid (notNA
) length of a data. -
n_unique()
Number of unique values. -
n_missing()
Number of missing values. -
row_col_mean(), row_col_sum()
Adds a row with the mean/sum of each variable and a column with the the mean/sum for each row of the data. -
sd_amo(), sd_pop()
Computes sample and populational standard deviation, respectively. -
sem()
computes the standard error of the mean. -
skew()
computes the skewness like used in SAS and SPSS. -
ave_dev()
computes the average of the absolute deviations. -
sum_dev()
computes the sum of the absolute deviations. -
sum_sq()
computes the sum of the squared values. -
sum_sq_dev()
computes the sum of the squared deviations. -
var_amo(), var_pop()
computes sample and populational variance.
-
desc_stat()
is wrapper function around the above ones and can be
used to compute quickly all these statistics at once.
Usage
av_dev(.data, ..., na.rm = FALSE)
ci_mean_t(.data, ..., na.rm = FALSE, level = 0.95)
ci_mean_z(.data, ..., na.rm = FALSE, level = 0.95)
cv(.data, ..., na.rm = FALSE)
freq_table(.data, var, k = NULL, digits = 3)
freq_hist(
table,
xlab = NULL,
ylab = NULL,
fill = "gray",
color = "black",
ygrid = TRUE
)
hmean(.data, ..., na.rm = FALSE)
gmean(.data, ..., na.rm = FALSE)
kurt(.data, ..., na.rm = FALSE)
n_missing(.data, ..., na.rm = FALSE)
n_unique(.data, ..., na.rm = FALSE)
n_valid(.data, ..., na.rm = FALSE)
pseudo_sigma(.data, ..., na.rm = FALSE)
range_data(.data, ..., na.rm = FALSE)
row_col_mean(.data, na.rm = FALSE)
row_col_sum(.data, na.rm = FALSE)
sd_amo(.data, ..., na.rm = FALSE)
sd_pop(.data, ..., na.rm = FALSE)
sem(.data, ..., na.rm = FALSE)
skew(.data, ..., na.rm = FALSE)
sum_dev(.data, ..., na.rm = FALSE)
ave_dev(.data, ..., na.rm = FALSE)
sum_sq_dev(.data, ..., na.rm = FALSE)
sum_sq(.data, ..., na.rm = FALSE)
var_pop(.data, ..., na.rm = FALSE)
var_amo(.data, ..., na.rm = FALSE)
cv_by(.data, ..., .vars = NULL, na.rm = FALSE)
max_by(.data, ..., .vars = NULL, na.rm = FALSE)
min_by(.data, ..., .vars = NULL, na.rm = FALSE)
means_by(.data, ..., .vars = NULL, na.rm = FALSE)
mean_by(.data, ..., .vars = NULL, na.rm = FALSE)
n_by(.data, ..., .vars = NULL, na.rm = FALSE)
sd_by(.data, ..., .vars = NULL, na.rm = FALSE)
var_by(.data, ..., .vars = NULL, na.rm = FALSE)
sem_by(.data, ..., .vars = NULL, na.rm = FALSE)
sum_by(.data, ..., .vars = NULL, na.rm = FALSE)
Arguments
.data |
A data frame or a numeric vector. |
... |
The argument depends on the function used.
|
na.rm |
If |
level |
The confidence level for the confidence interval of the mean. Defaults to 0.95. |
var |
The variable to compute the frequency table. See |
k |
The number of classes to be created. See |
digits |
The number of significant figures to show. Defaults to 2. |
table |
A frequency table computed with |
xlab , ylab |
The |
fill , color |
The color to fill the bars and color the border of the bar, respectively. |
ygrid |
Shows a grid line on the |
.vars |
Used to select variables in the |
Details
The function freq_table()
computes a frequency table for either
numerical or categorical variables. If a variable is categorical or
discrete (integer values), the number of classes will be the number of
levels that the variable contains.
If a variable (say, data) is continuous, the number of classes (k) is given by
the square root of the number of samples (n) if n =< 100
or 5 * log10(n)
if n > 100
.
The amplitude (\(A\)) of the data is used to define the size of the class (\(c\)), given by
\[c = \frac{A}{n - 1}\]The lower limit of the first class (LL1) is given by min(data) - c / 2. The upper limit is given by LL1 + c. The limits of the other classes are given in the same way. After the creation of the classes, the absolute and relative frequencies within each class are computed.
Value
Functions
*_by()
returns atbl_df
with the computed statistics by each level of the factor(s) declared in...
.All other functions return a named integer if the input is a data frame or a numeric value if the input is a numeric vector.
-
freq_table()
Returns a list with the frequency table and the breaks used for class definition. These breaks can be used to construct an histogram of the variable.
Author(s)
Tiago Olivoto tiagoolivoto@gmail.com
References
Ferreira, Daniel Furtado. 2009. Estatistica Basica. 2 ed. Vicosa, MG: UFLA.
Examples
library(metan)
# means of all numeric variables by ENV
mean_by(data_ge2, GEN, ENV)
# Coefficient of variation for all numeric variables
# by GEN and ENV
cv_by(data_ge2, GEN, ENV)
# Skewness of a numeric vector
set.seed(1)
nvec <- rnorm(200, 10, 1)
skew(nvec)
# Confidence interval 0.95 for the mean
# All numeric variables
# Grouped by levels of ENV
data_ge2 %>%
group_by(ENV) %>%
ci_mean_t()
# standard error of the mean
# Variable PH and EH
sem(data_ge2, PH, EH)
# Frequency table for variable NR
data_ge2 %>%
freq_table(NR)