inspect_num {inspectdf} | R Documentation |
Summary and comparison of numeric columns
Description
For a single dataframe, summarise the numeric columns. If two dataframes are supplied, compare numeric columns appearing in both dataframes. For grouped dataframes, summarise numeric columns separately for each group.
Usage
inspect_num(df1, df2 = NULL, breaks = 20, include_int = TRUE)
Arguments
df1 |
A dataframe. |
df2 |
An optional second dataframe for comparing categorical levels.
Defaults to |
breaks |
Integer number of breaks used for histogram bins, passed to
|
include_int |
Logical flag, whether to include integer columns in numeric summaries.
Defaults to |
Details
For a single dataframe, the tibble returned contains the columns:
-
col_name
, a character vector containing the column names indf1
-
min
,q1
,median
,mean
,q3
,max
andsd
, the minimum, lower quartile, median, mean, upper quartile, maximum and standard deviation for each numeric column. -
pcnt_na
, the percentage of each numeric feature that is missing -
hist
, a named list of tibbles containing the relative frequency of values falling in bins determined bybreaks
.
For a pair of dataframes, the tibble returned contains the columns:
-
col_name
, a character vector containing the column names indf1
anddf2
-
hist_1
,hist_2
, a list column for histograms of each ofdf1
anddf2
. Where a column appears in both dataframe, the bins used fordf1
are reused to calculate histograms fordf2
. jsd, a numeric column containing the Jensen-Shannon divergence. This measures the difference in distribution of a pair of binned numeric features. Values near to 0 indicate agreement of the distributions, while 1 indicates disagreement.
-
pval
, the p-value corresponding to a NHT that the true frequencies of histogram bins are equal. A small p indicates evidence that the the two sets of relative frequencies are actually different. The test is based on a modified Chi-squared statistic.
For a grouped dataframe, the tibble returned is as for a single dataframe, but where
the first k
columns are the grouping columns. There will be as many rows in the result
as there are unique combinations of the grouping variables.
Value
A tibble
containing statistical summaries of the numeric
columns of df1
, or comparing the histograms of df1
and df2
.
Author(s)
Alastair Rushworth
See Also
Examples
# Load dplyr for starwars data & pipe
library(dplyr)
# Single dataframe summary
inspect_num(starwars)
# Paired dataframe comparison
inspect_num(starwars, starwars[1:20, ])
# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_num()