inspect_imb {inspectdf} | R Documentation |
Summary and comparison of the most common levels in categorical columns
Description
For a single dataframe, summarise the most common level in each categorical column. If two dataframes are supplied, compare the most common levels of categorical features appearing in both dataframes. For grouped dataframes, summarise the levels of categorical columns in the dataframe split by group.
Usage
inspect_imb(df1, df2 = NULL, include_na = FALSE)
Arguments
df1 |
A dataframe. |
df2 |
An optional second data frame for comparing columnwise imbalance.
Defaults to |
include_na |
Logical flag, whether to include missing values as a unique level. Default
is |
Details
For a single dataframe, the tibble returned contains the columns:
-
col_name
, a character vector containing column names ofdf1
. -
value
, a character vector containing the most common categorical level in each column ofdf1
. -
pcnt
, the relative frequency of each column's most common categorical level expressed as a percentage. -
cnt
, the number of occurrences of the most common categorical level in each column ofdf1
.
For a pair of dataframes, the tibble returned contains the columns:
-
col_name
, a character vector containing names of the unique columns indf1
anddf2
. -
value
, a character vector containing the most common categorical level in each column ofdf1
. -
pcnt_1
,pcnt_2
, the percentage occurrence ofvalue
in the columncol_name
for each ofdf1
anddf2
, respectively. -
cnt_1
,cnt_2
, the number of occurrences of ofvalue
in the columncol_name
for each ofdf1
anddf2
, respectively. -
p_value
, p-value associated with the null hypothesis that the true rate of occurrence is the same for both dataframes. Small values indicate stronger evidence of a difference in the rate of occurrence.
For a grouped dataframe, the tibble returned is as for a single dataframe, but where
the first k
columns are the grouping columns. There will be as many rows in the result
as there are unique combinations of the grouping variables.
Value
A tibble summarising and comparing the imbalance for each categorical column in one or a pair of dataframes.
Author(s)
Alastair Rushworth
See Also
Examples
# Load dplyr for starwars data & pipe
library(dplyr)
# Single dataframe summary
inspect_imb(starwars)
# Paired dataframe comparison
inspect_imb(starwars, starwars[1:20, ])
# Grouped dataframe summary
starwars %>% group_by(gender) %>% inspect_imb()