R: Create a Pivot (Summary) Table

pivot {lessR}

R Documentation

Create a Pivot (Summary) Table

Description

Compute one or more designated descriptive statistics (compute over one or more numerical variables (variable) either for all the data or aggregated over one or more categorical variables (by). Because the output is a two-dimensional table, select any two of the three possibilities: Multiple compute functions for the descriptive statistics, multiple continuous variables over which to compute, and multiple categorical variables by which to define groups for aggregation. Displays the sample size for each group. Uses the base R function aggregate for which to perform the aggregation.

Usage

pivot(data, compute, variable, by=NULL, by_cols=NULL, filter=NULL,
         show_n=TRUE, na_by_show=TRUE, na_remove=TRUE, na_group_show=TRUE,
         out_names=NULL, sort=NULL, sort_var=NULL,  
         table_prop=c("none", "all", "row", "col"), table_long=FALSE,
         factors=TRUE, q_num=4, digits_d=NULL, quiet=getOption("quiet"))

Arguments

`data`	Data frame that contains the variables.
`compute`	One or more statistics, defined as one or more functions, to aggregate over the combinations of the values of the categorical variables.
`variable`	One or more numeric response variables for which to `compute` the specified statistics, perhaps aggregated, i.e., summarized across the groups.
`by`	Categorical variables that define the groups (cells) listed in the rows of the output long-form data frame, available to input into other data analysis routines. Ignore to compute over the variables for all the data, e.g., the grand mean.
`by_cols`	Up to two categorical variables that define the groups displayed as columns in a two dimensional table.
`filter`	Subset, i.e., filter, rows of the input data frame for analysis.

`show_n`	By default, display the sample size and number missing for each computed summary statistic. If `FALSE`, delete all variables from the output data frame that end with `n_` or `na_`.
`na_by_show`	If `TRUE`, the default, if all values of 'variable' are missing for a group so that the entire level of the 'by' variables is missing, show those missing cells with a reported value of computed variable `n` as 0. Otherwise delete the row from the output.
`na_remove`	Sets base R parameter `na.rm`. If `TRUE`, the default, removes missing values from the `variable`(s), then reports how many values were missing. Otherwise, the aggregation statistic for a cell with any missing data returns `NA`.
`na_group_show`	If `TRUE`, the default, display `<NA>` for missing data of a grouping variable as a level for that variable. Otherwise, do not treat a missing value of a group as a level for which to aggregate.
`out_names`	Custom names for the aggregated variables. If more than one, list in the same order as specified in `variable`. Does not apply to the `table` option where the column names are the levels of the `by` variable(s).

`sort`	Set to `"+"` for an ascending sort or `"-"` for a descending sort according to the last variable in the output data frame.
`sort_var`	Either the name of the variable in the output data frame to sort, or its column number. Default is the last column.

`table_prop`	Applies to a created `table` for the value of `compute`. Default value of `"none"` leaves frequencies. Value of `"all"` converts to cell proportions based on the grand total. Values of `"row"` and `"col"` provide proportions based on row and column sums.
`table_long`	Applies to the value of `compute` of `table`. If set to `TRUE`, then the cross-tabs table is output in long form, one count per row.

`factors`	For `by` variables of type `character` and `integer`, converted to factors in the summary table by default, except for `Date` variables that always retain their type. If `FALSE`, then the `by` variables retain their original character or integer type.
`q_num`	For the computation of quantiles, number of intervals. Default value of 4 provides quartiles.
`digits_d`	Number of significant digits for each displayed summary statistic. Trailing zeros are deleted, so, for example, integers display as integers. If not specified, defaults to 3 unless there are more than 3 decimal digits and only a single digit to the left of the decimal point. Then enough digits are displayed to capture some non-zero decimal digits to avoid rounding to 0.000. To see all digits without trailing decimal 0's, set at a large number such as 20.
`quiet`	If set to `TRUE`, no text output. Can change system default with `style` function.

Details

pivot uses base R aggregate to generate a pivot table (Excel terminology). Express multiple categorical variables over which to pivot as a vector with the c function.

pivot provides two additional features than aggregate provides. First is a complete missing data analysis. If there is no missing data for the numerical variables that are aggregated, then the cell sizes are included with the aggregated data. If there is such missing data, then the amount of available data is displayed for all values to be aggregated for each cell.

The second is that the data parameter is listed first in the parameter list, which facilitates the use of the pipe operator from the magrittr package. Also, there is a different interface as the by variables are specified as a vector.

Variable ranges in the specification of by are not needed in general. Only a small number of grouping variables generally define the cells for the aggregation.

The following table lists available single summary statistics. The list is not necessarily exhaustive as the references are to functions provided by base R, including any not listed below.

Statistic	Meaning
-----------	--------------------------------
`sum`	sum
`mean`	arithmetic mean
`median`	median
`min`	minimum
`max`	maximum
`sd`	standard deviation
`var`	variance
`skew`	skew
`kurtosis`	kurtosis
`IQR`	inter-quartile range
`mad`	mean absolute deviation
-----------	--------------------------------

The functions skew() and kurtosis() are provided by this package as they have no counterparts in base R. All other functions are from base R.

The quantile and table statistical function returns multiple values.

Statistic	Meaning
-----------	--------------------------------
`quantile`	min, quartiles, max
`table`	frequencies or proportions
-----------	--------------------------------

The table computation applies to an aggregated variable that consists of discrete categories, such as the numbers 1 through 5 for responses to a 5-pt Likert scale. The result is a table of frequencies or proportions, a contingency table, referred to for two or more variables as a cross-tabulation table or a joint frequency distribution. Other statistical functions can be simultaneously computed with table, though only meaningful if the aggregated variable consists of a relatively small set of discrete, numeric values.

The default quantiles for quantile are quartiles. Specify a custom number of quantiles with the q_num parameter, which has the default value of 4 for quartiles.

Value

Returns a data frame of the aggregated values, unless for two by variables and table_2d is TRUE, when a table is returned.

The count of the number of elements in each group is provided as the variable n. If a combination of by variable levels that defines a group is empty, the n is set to 0 with the values of the variable set to NA.

The number of missing elements of the value variable is provided as the variable miss.

Author(s)

David W. Gerbing (Portland State University; gerbing@pdx.edu)

Examples

library(knitr)  # for kable() called from pivot()
d <- Read("Employee", quiet=TRUE)

# parameter values named
pivot(data=d, compute=mean, variable=Salary, by=c(Dept, Gender))

# visualize the aggregation
# when reading a table of coordinates, a, BarChart cannot deal with
#   with missing data so do not show groups that are missing as
#   another level
a <- pivot(d, mean, Salary, c(Dept, Gender), na_group_show=FALSE)
BarChart(Dept, Salary_mean, by=Gender, data=a)

# calculate mean of Years and Salary for each combination of Dept and Gender
# parameter values by position
pivot(d, mean, c(Years, Salary), c(Dept, Gender))

# output as a 2-d cross-tabulation table
pivot(d, mean, Salary, Dept, Gender)

# cross-tabulation table
pivot(d, table, Dept, Gender)
# long form
pivot(d, table, Dept, Gender, table_long=TRUE)

# multiple functions for which to aggregate
pivot(d, c(mean,sd,median,IQR), Years, c(Gender,Dept), digits_d=2)

# A variety of statistics computed for several variables over the
#  entire data set without aggregation
pivot(d, c(mean,sd,skew,kurtosis), c(Years,Salary,Pre,Post), digits_d=2)

[Package lessR version 4.3.6 Index]