R: Sorts numeric from discrete variables and returns separate...

summarize {rockchalk}

R Documentation

Sorts numeric from discrete variables and returns separate summaries for those types of variables.

Description

The work is done by the functions summarizeNumerics and summarizeFactors. Please see the help pages for those functions for complete details.

Usage

summarize(
  dat,
  alphaSort = FALSE,
  stats = c("mean", "sd", "skewness", "kurtosis", "entropy", "normedEntropy", "nobs",
    "nmiss"),
  probs = c(0, 0.5, 1),
  digits = 3,
  ...
)

Arguments

`dat`	A data frame
`alphaSort`	If TRUE, the columns are re-organized in alphabetical order. If FALSE, they are presented in the original order.
`stats`	A vector of desired summary statistics. Set `stats = NULL` to omit all stat summaries. Legal elements are `c("min", "med", "max", "mean", "sd", "var", "skewness", "kurtosis", "entropy", "normedEntropy", "nobs", "nmiss")`. The statistics `c("entropy", "normedEntropy")` are available only for factor variables, while mean, variance, and so forth will be calculated only for numeric variables. `"nobs"` is the number of observations with non-missing, finite scores (not NA, NaN, -Inf, or Inf). `"nmiss"` is the number of cases with values of NA. The default setting for `probs` will cause `c("min", "med", "max")` to be included, they need not be requested explicitly. To disable them, revise `probs`.
`probs`	For numeric variables, is used with the `quantile` function. The default is `probs = c(0, .50, 1.0)`, which are labeled in output as `c("min", "med", and "max")`. Set `probs = NULL` to prevent these in the output.
`digits`	Decimal values to display, defaults as 2.
`...`	Optional arguments that are passed to `summarizeNumerics` and `summarizeFactors`. For numeric variables, one can specify `na.rm` and `unbiased`. For discrete variables, the key argument is `maxLevels`, which determines the number of levels that will be reported in tables for discrete variables.

Details

The major purpose here is to generate summary data structure that is more useful in subsequent data analysis. The numeric portion of the summaries are a data frame that can be used in plots or other diagnostics.

The term "factors" was used, but "discrete variables" would have been more accurate. The factor summaries will collect all logical, factor, ordered, and character variables.

Other variable types, such as Dates, will be ignored, with a warning.

Value

Return is a list with two objects 1) output from summarizeNumerics: a data frame with variable names on rows and summary stats on columns, 2) output from summarizeFactors: a list with summary information about each discrete variable. The display on-screen is governed by a method print.summarize.

Author(s)

Paul E. Johnson pauljohn@ku.edu

Examples

library(rockchalk)


set.seed(23452345)
N <- 100
x1 <- gl(12, 2, labels = LETTERS[1:12])
x2 <- gl(8, 3, labels = LETTERS[12:24])
x1 <- sample(x = x1, size=N, replace = TRUE)
x2 <- sample(x = x2, size=N, replace = TRUE)
z1 <- rnorm(N)
a1 <- rnorm(N, mean = 1.2, sd = 11.7)
a2 <- rpois(N, lambda = 10 + abs(a1))
a3 <- rgamma(N, 0.5, 4)
b1 <- rnorm(N, mean = 211.3, sd = 0.4)
dat <- data.frame(z1, a1, x2, a2, x1, a3, b1)
summary(dat)

summarize(dat)

summarize(dat, digits = 4)

summarize(dat, stats = c("min", "max", "mean", "sd"),
          probs = c(0.25, 0.75))

summarize(dat, probs = c(0, 0.20, 0.80),
          stats = c("nobs", "mean", "med", "entropy"))

summarize(dat, probs = c(0, 0.20, 0.50),
          stats = c("nobs", "nmiss", "mean", "entropy"), maxLevels=10)

dat.sum <- summarize(dat, probs = c(0, 0.20, 0.50),
                     stats = c("nobs", "nmiss", "mean", "entropy"), maxLevels=10)
dat.sum
## Inspect unformatted structure of objects within return
dat.sum[["numerics"]]
dat.sum[["factors"]]

## Only quantile values, no summary stats for numeric variables
## Discrete variables get entropy
summarize(dat, 
          probs = c(0, 0.25, 0.50, 0.75, 1.0),
          stats = "entropy", digits = 2)

## Quantiles and the mean for numeric variables.
## No diversity stats for discrete variables (entropy omitted)
summarize(dat, 
          probs = c(0, 0.25, 0.50, 0.75, 1.0),
          stats = "mean")

summarize(dat, 
          probs = NULL,
          stats = "mean")

## Note: output is not beautified by a print method
dat.sn <- summarizeNumerics(dat)
dat.sn
formatSummarizedNumerics(dat.sn)
formatSummarizedNumerics(dat.sn, digits = 5)

dat.summ <- summarize(dat)


dat.sf <- summarizeFactors(dat, maxLevels = 20)
dat.sf
formatSummarizedFactors(dat.sf)

## See actual values of factor summaries, without
## beautified printing
summarizeFactors(dat, maxLevels = 5)
formatSummarizedFactors(summarizeFactors(dat, maxLevels = 5))

summarize(dat, alphaSort = TRUE) 

summarize(dat, digits = 6, alphaSort = FALSE)


summarize(dat, maxLevels = 2)

datsumm <- summarize(dat, stats = c("mean", "sd", "var", "entropy", "nobs"))

## Unbeautified numeric data frame, variables on the rows
datsumm[["numerics"]]
## Beautified versions 1. shows saved version:
attr(datsumm, "numeric.formatted")
## 2. Run formatSummarizedNumerics to re-specify digits:
formatSummarizedNumerics(datsumm[["numerics"]], digits = 10)

datsumm[["factors"]]
formatSummarizedFactors(datsumm[["factors"]])
formatSummarizedFactors(datsumm[["factors"]], digits = 6, maxLevels = 10)

[Package rockchalk version 1.8.157 Index]