tabl {labelr} | R Documentation |
Construct Value Label-Friendly Frequency Tables
Description
tabl
calculates raw or weighted frequency counts (or proportions) over
arbitrary categorical values (including integer values), which may be
expressed in terms of raw variable values or labelr label values.
Usage
tabl(
data,
vars = NULL,
labs.on = TRUE,
qtiles = 4,
prop.digits = NULL,
wt = NULL,
div.by = NULL,
max.unique.vals = 10,
sort.freq = TRUE,
zero.rm = FALSE,
irreg.rm = FALSE,
wide.col = NULL
)
Arguments
data |
a data.frame. |
vars |
a quoted character vector of variable names of variables you wish
to include in defining category groups to tabulate over in the table. If NULL
|
labs.on |
if TRUE (the default), then value labels – rather than the raw variable values – will be displayed in the returned table for any value-labeled variables. Variables need not be value-labeled: This command (with this option set to TRUE or FALSE) will work even when no variables are value-labeled. |
qtiles |
if not NULL, must be a 1L integer between 2 and 100 indicating the number of quantile categories to employ in temporarily (for purposes of tabulation) auto-value-labeling numeric columns that exceed the max.unique.vals threshold. If NULL, no such auto-value-labeling will take place. Note: When labs.on = TRUE, any pre-existing variable value labels will be used in favor of the quantile value labels generated by this argument. By default, qtiles = 4, and the automatically generated quantile category levels will be labeled as "q025" (i.e., first quartile), "q050", "q075", and "q100". |
prop.digits |
if non-NULL, cell percentages (proportions) will be returned instead of frequency counts, and these will be rounded to the digit specified (e.g., prop.digits = 3 means a value of 0.157 would be returned for a cell that accounted for 8 observations if the total number of observations were 51). If NULL (the default), frequency counts will be returned. |
wt |
an optional vector that includes cell counts or some other idiosyncratic "importance" weight. If NULL, no weighting will be employed. |
div.by |
Divide the returned counts by a constant for scaling purposes. This may be a number (e.g., div.by = 10 to divide by 10) or a character that follows the convention "number followed by 'K', 'M', or 'B'", where, e.g., "10K" is translated as 10000, "1B" is translated as 1000000000, etc. |
max.unique.vals |
Integer to specify the maximum number of unique values of a variable that may be observed for that variable to be included in tabulations. Note that labelr sets a hard ceiling of 5000 on the total number of unique value labels that any variable is permitted to have under any circumstance, as labelr is primarily intended for interactive use with moderately-sized data.frames. See the qtiles argument for an approach to incorporating many-valued numeric variables that exceed the max.unique.vals threshold. |
sort.freq |
By default, returned table rows are sorted in descending order of cell frequency (most frequent categories/combinations first). If set to FALSE, table rows will be sorted by the the distinct values of the vars (in the order vars are specified). |
zero.rm |
If TRUE, zero-frequency vars categories/combinations (i.e., those not observed in the data.frame) will be filtered from the table. For tables that would produce more than 10000 rows, this is done automatically. |
irreg.rm |
If TRUE, tabulations exclude cases where any applicable variable (see vars argument) features any of the following "irregular" values: NA, NaN, Inf, -Inf, or any non-case-sensitive variation on "NA", "NAN", "INF", or "-INF." If FALSE, all "irregular" values (as just defined) are assigned to a "catch-all" category of NA that is featured in the returned table (if/where present). |
wide.col |
If non-NULL, this is the quoted name of a single column / var of supplied data.frame whose distinct values (category levels) you wish to be columns of the returned table. For example, if you are interested in a cross-tab of "edu" (highest level of education) and "race" (a race/ethnicity variable), you could supply vars= c("edu") and wide.col = "race", and the different racial-ethnic group categories would appear as distinct columns, with "edu" category levels appearing as distinct rows, and cell values representing the cross-tabbed cell "edu" level frequencies for the respective "race" groups (see examples). You may supply one wide.col at most. |
Details
This function creates a labelr-friendly data.frame representation of multi-variable tabular data, where either value labels or values can be displayed (use of value labels is the default), and where various convenience options are provided, such as using frequency weights, using proportions instead of counts, rounding those percentages, automatically expressing many-valued, non-value-labeled numerical variables in terms of quantile category groups, or pivoting / casting one of the categorical variables' levels (labels) to serve as columns in a cross-tab-like table.
Value
a data.frame.
Examples
# assign mtcars to new data.frame df
df <- mtcars
# add na values to make things interesting
df[1, 1:11] <- NA
rownames(df)[1] <- "Missing Car"
# add value labels
df <- add_val_labs(
data = df,
vars = "am",
vals = c(0, 1),
labs = c("automatic", "manual")
)
df <- add_val_labs(
data = df,
vars = "carb",
vals = c(1, 2, 3, 4, 6, 8),
labs = c(
"1-carb", "2-carbs",
"3-carbs", "4-carbs",
"6-carbs", "8-carbs"
)
)
# var arg can be unquoted if using add_val1()
# note that this is not add_val_labs(); add_val1() has "var" arg instead of "vars
df <- add_val1(
data = df,
var = cyl, # note, "var," not "vars" arg
vals = c(4, 6, 8),
labs = c(
"four-cyl",
"six-cyl",
"eight-cyl"
)
)
df <- add_val_labs(
data = df,
vars = "gear",
vals = 3:5,
labs = c(
"3-speed",
"4-speed",
"5-speed"
)
)
# lookup mapping
get_val_labs(df)
# introduce other "irregular" values
df$am[1] <- NA
df[2, "am"] <- NaN
df[3, "am"] <- -Inf
df[5, "cyl"] <- "NAN"
# take a look
head(df)
# demonstrate tabl() frequency tabulation function
# this is the "first call" that will be referenced repeatedly below
# labels on, sort by variable values, suppress/exclude NA/irregular values
# ...return counts
tabl(df,
vars = c("cyl", "am"),
labs.on = TRUE, # use variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = TRUE, # NAs and the like are suppressed
prop.digits = NULL
) # return counts, not proportions
# same as "first call", except now value labels are off
tabl(df,
vars = c("cyl", "am"),
labs.on = FALSE, # use variable values
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = TRUE, # NAs and the like are suppressed
prop.digits = NULL
) # return counts, not proportions
# same as "first call," except now proportions instead of counts
tabl(df,
vars = c("cyl", "am"),
labs.on = TRUE, # use variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = TRUE, # NAs and the like are suppressed
prop.digits = 3
) # return proportions, rounded to 3rd decimal
# same as "first call," except now sort by frequency counts
tabl(df,
vars = c("cyl", "am"),
labs.on = TRUE, # use variable value labels
sort.freq = TRUE, # sort in order of descending frequency
irreg.rm = TRUE, # NAs and the like are suppressed
prop.digits = NULL
) # return proportions, rounded to 3rd decimal
# same as "first call," except now use weights
set.seed(2944) # for reproducibility
df$freqwt <- sample(10:50, nrow(df), replace = TRUE) # create (fake) freq wts
tabl(df,
vars = c("cyl", "am"),
wt = "freqwt", # use frequency weights
labs.on = TRUE, # use variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = FALSE, # NAs and the like are included/shown
prop.digits = NULL
) # return counts, not proportions
df$freqwt <- NULL # we don't need this anymore
# now, with extremely large weights to illustrate div.by
set.seed(428441) # for reproducibility
df$freqwt <- sample(1000000:10000000, nrow(df), replace = TRUE) # large freq wts
tabl(df,
vars = c("cyl", "am"),
wt = "freqwt", # use frequency weights
labs.on = TRUE, # use variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = FALSE, # NAs and the like are included/shown
prop.digits = NULL
) # return counts, not proportions
# show div by - Millions
tabl(df,
vars = c("cyl", "am"),
wt = "freqwt", # use frequency weights
labs.on = TRUE, # use variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = FALSE, # NAs and the like are included/shown
prop.digits = NULL, # return counts, not proportions
div.by = "1M"
) # one million
# show div by - Tens of millions
tabl(df,
vars = c("cyl", "am"),
wt = "freqwt", # use frequency weights
labs.on = TRUE, # use variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = FALSE, # NAs and the like are included/shown
prop.digits = NULL, # return counts, not proportions
div.by = "10M"
) # ten million
# show div by - 10000
tabl(df,
vars = c("cyl", "am"),
wt = "freqwt", # use frequency weights
labs.on = TRUE, # use variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = FALSE, # NAs and the like are included/shown
prop.digits = NULL, # return counts, not proportions
div.by = 10000
) # ten thousand; could've used div.by = "10K"
# show div by - 10000, but different syntax
tabl(df,
vars = c("cyl", "am"),
wt = "freqwt", # use frequency weights
labs.on = TRUE, # use variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = FALSE, # NAs and the like are included/shown
prop.digits = NULL, # return counts, not proportions
div.by = "10K"
) # ten thousand; could've used div.by = 10000
df$freqwt <- NULL # we don't need this anymore
# turn labels off, to make this more compact
# do not show zero values (zero.rm)
# do not show NA values (irreg.rm)
# many-valued numeric variables will be converted to quantile categories by
# ...qtiles argument
tabl(df,
vars = c("am", "gear", "carb", "mpg"),
qtiles = 4, # many-valued numerics converted to quantile
labs.on = FALSE, # use values, not variable value labels
sort.freq = FALSE, # sort by vars values (not frequencies)
irreg.rm = TRUE, # NAs and the like are suppressed
zero.rm = TRUE, # variable combinations that never occur are suppressed
prop.digits = NULL, # return counts, not proportions
max.unique.vals = 10
) # drop from table any var with >10 distinct values
# same as above, but include NA/irregular category values,
# zero.rm is TRUE; include unobserved (zero-count) category combinations
tabl(df,
vars = c("am", "gear", "carb", "mpg"),
qtiles = 4,
labs.on = FALSE, # use values, not variable value labels
sort.freq = TRUE, # sort by frequency
irreg.rm = FALSE, # preserve/include NAs and irregular values
zero.rm = FALSE, # include non-observed combinations
prop.digits = NULL, # return counts, not proportions
max.unique.vals = 10
) # drop from table any var with >10 distinct values
# show cross-tab view with wide.col arg
tabl(df,
vars = c("cyl", "am"),
labs.on = TRUE, # use variable value labels
sort.freq = TRUE, # sort by vars values (not frequencies)
irreg.rm = TRUE, # NAs and the like are suppressed
prop.digits = NULL, # return counts, not proportions
wide.col = "am"
) # use "am" as a column variable in a cross-tab view
tabl(df,
vars = c("cyl", "am"),
labs.on = TRUE, # use variable value labels
sort.freq = TRUE, # sort by vars values (not frequencies)
irreg.rm = TRUE, # NAs and the like are suppressed
prop.digits = NULL, # return counts, not proportions
wide.col = "cyl"
) # use "cyl" as a column variable in a cross-tab view
# verify select counts using base::subset()
nrow(subset(df, am == 0 & cyl == 4))
nrow(subset(df, am == 0 & cyl == 8))
nrow(subset(df, am == 1 & cyl == 8))
nrow(subset(df, am == 0 & cyl == 6))
nrow(subset(df, am == 1 & cyl == 6))
# will work on an un-labeled data.frame
tabl(mtcars, vars = c("am", "gear", "carb", "mpg"))