summarize {LMMstar} | R Documentation |
Compute summary statistics
Description
Compute summary statistics for multiple variables and/or multiple groups and save them in a data frame.
Usage
summarize(
formula,
data,
na.action = stats::na.pass,
na.rm = FALSE,
level = 0.95,
columns = c("observed", "missing", "pc.missing", "mean", "sd", "min", "q1", "median",
"q3", "max", "correlation"),
FUN = NULL,
skip.reference = TRUE,
digits = NULL,
filter = NULL,
...
)
Arguments
formula |
[formula] on the left hand side the outcome(s) and on the right hand side the grouping variables.
E.g. Y1+Y2 ~ Gender + Gene will compute for each gender and gene the summary statistics for Y1 and for Y2.
Passed to the |
data |
[data.frame] dataset containing the observations. |
na.action |
[function] a function which indicates what should happen when the data contain 'NA' values.
Passed to the |
na.rm |
[logical] Should the summary statistics be computed by omitting the missing values. |
level |
[numeric,0-1] the confidence level of the confidence intervals. |
columns |
[character vector] name of the summary statistics to kept in the output. Can be any of, or a combination of:
|
FUN |
[function] user-defined function for computing summary statistics. It should take a vector as an argument and output a named single value or a named vector. |
skip.reference |
[logical] should the summary statistics for the reference level of categorical variables be omitted? |
digits |
[integer, >=0] the minimum number of significant digits to be used to display the results. Passed to |
filter |
[character] a regular expression passed to |
... |
additional arguments passed to argument |
Details
This function is essentially an interface to the stats::aggregate
function.
WARNING: it has the same name as a function from the dplyr package. If you have loaded dplyr already, you should use :::
to call summarize i.e. use LMMstar:::summarize
.
Confidence intervals (CI) and prediction intervals (PI) for the mean are computed via stats::t.test
.
Confidence intervals (CI) for the median are computed via asht::medianTest
.
Correlation can be assessed when a grouping and ordering variable are given in the formula interface , e.g. Y ~ time|id.
Value
A data frame containing summary statistics (in columns) for each outcome and value of the grouping variables (rows). It has an attribute "correlation"
when it was possible to compute the correlation matrix for each outcome with respect to the grouping variable.
Examples
#### simulate data (wide format) ####
set.seed(10)
d <- sampleRem(1e2, n.times = 3)
d$treat <- sample(LETTERS[1:3], NROW(d), replace=TRUE, prob=c(0.3, 0.3, 0.4) )
## add a missing value
d2 <- d
d2[1,"Y2"] <- NA
#### summarize (wide format) ####
summarize(Y1 ~ 1, data = d)
summarize(Y1 ~ 1, data = d, FUN = quantile, p = c(0.25,0.75))
summarize(Y1+Y2 ~ X1, data = d)
summarize(treat ~ 1, data = d)
summarize(treat ~ 1, skip.reference = FALSE, data = d)
summarize(Y1 ~ X1, data = d2)
summarize(Y1+Y2 ~ X1, data = d2, na.rm = TRUE)
summarize(. ~ treat, data = d2, na.rm = TRUE, filter = "Y")
#### summarize (long format) ####
dL <- reshape(d, idvar = "id", direction = "long",
v.names = "Y", varying = c("Y1","Y2","Y3"))
summarize(Y ~ time + X1, data = dL)
## compute correlations (single time variable)
e.S <- summarize(Y ~ time + X1 | id, data = dL, na.rm = TRUE)
e.S
attr(e.S, "correlation")
## compute correlations (composite time variable)
dL$time2 <- dL$time == 2
dL$time3 <- dL$time == 3
e.S <- summarize(Y ~ time2 + time3 + X1 | id, data = dL, na.rm = TRUE)
e.S
attr(e.S, "correlation")