summary.formula {Hmisc}  R Documentation 
Summarize Data for Making Tables and Plots
Description
summary.formula
summarizes the variables listed in an S formula,
computing descriptive statistics (including ones in a
userspecified function). The summary statistics may be passed to
print
methods, plot
methods for making annotated dot charts, and
latex
methods for typesetting tables using LaTeX.
summary.formula
has three methods for computing descriptive
statistics on univariate or multivariate responses, subsetted by
categories of other variables. The method of summarization is
specified in the parameter method
(see details below). For the
response
and cross
methods, the statistics used to
summarize the data
may be specified in a very flexible way (e.g., the geometric mean,
33rd percentile, KaplanMeier 2year survival estimate, mixtures of
several statistics). The default summary statistic for these methods
is the mean (the proportion of positive responses for a binary
response variable). The cross
method is useful for creating data
frames which contain summary statistics that are passed to trellis
as raw data (to make multipanel dot charts, for example). The
print
methods use the print.char.matrix
function to print boxed
tables.
The right hand side of formula
may contain mChoice
(“multiple choice”) variables. When test=TRUE
each choice is
tested separately as a binary categorical response.
The plot
method for method="reverse"
creates a temporary
function Key
in frame 0 as is done by the xYplot
and
Ecdf.formula
functions. After plot
runs, you can type
Key()
to put a legend in a default location, or
e.g. Key(locator(1))
to draw a legend where you click the left
mouse button. This key is for categorical variables, so to have the
opportunity to put the key on the graph you will probably want to use
the command plot(object, which="categorical")
. A second function
Key2
is created if continuous variables are being plotted. It is
used the same as Key
. If the which
argument is not
specified to plot
, two pages of plots will be produced. If you
don't define par(mfrow=)
yourself,
plot.summary.formula.reverse
will try to lay out a multipanel
graph to best fit all the individual dot charts for continuous
variables.
There is a subscripting method for objects created with
method="response"
.
This can be used to print or plot selected variables or summary statistics
where there would otherwise be too many on one page.
cumcategory
is a utility function useful when summarizing an ordinal
response variable. It converts such a variable having k
levels to a
matrix with k1
columns, where column i
is a vector of zeros and
ones indicating that the categorical response is in level i+1
or
greater. When the left hand side of formula
is cumcategory(y)
,
the default fun
will summarize it by computing all of the relevant
cumulative proportions.
Functions conTestkw
, catTestchisq
, ordTestpo
are
the default statistical test functions for summary.formula
.
These defaults are: WilcoxonKruskalWallis test for continuous
variables, Pearson chisquare test for categorical variables, and the
likelihood ratio chisquare test from the proportional odds model for
ordinal variables. These three functions serve also as templates for
the user to create her own testing functions that are selfdefining in
terms of how the results are printed or rendered in LaTeX, or plotted.
Usage
## S3 method for class 'formula'
summary(formula, data=NULL, subset=NULL,
na.action=NULL, fun = NULL,
method = c("response", "reverse", "cross"),
overall = method == "response"  method == "cross",
continuous = 10, na.rm = TRUE, na.include = method != "reverse",
g = 4, quant = c(0.025, 0.05, 0.125, 0.25, 0.375, 0.5, 0.625,
0.75, 0.875, 0.95, 0.975),
nmin = if (method == "reverse") 100
else 0,
test = FALSE, conTest = conTestkw, catTest = catTestchisq,
ordTest = ordTestpo, ...)
## S3 method for class 'summary.formula.response'
x[i, j, drop=FALSE]
## S3 method for class 'summary.formula.response'
print(x, vnames=c('labels','names'), prUnits=TRUE,
abbreviate.dimnames=FALSE,
prefix.width, min.colwidth, formatArgs=NULL, markdown=FALSE, ...)
## S3 method for class 'summary.formula.response'
plot(x, which = 1, vnames = c('labels','names'), xlim, xlab,
pch = c(16, 1, 2, 17, 15, 3, 4, 5, 0), superposeStrata = TRUE,
dotfont = 1, add = FALSE, reset.par = TRUE, main, subtitles = TRUE,
...)
## S3 method for class 'summary.formula.response'
latex(object, title = first.word(deparse(substitute(object))), caption,
trios, vnames = c('labels', 'names'), prn = TRUE, prUnits = TRUE,
rowlabel = '', cdec = 2, ncaption = TRUE, ...)
## S3 method for class 'summary.formula.reverse'
print(x, digits, prn = any(n != N), pctdig = 0,
what=c('%', 'proportion'),
npct = c('numerator', 'both', 'denominator', 'none'),
exclude1 = TRUE, vnames = c('labels', 'names'), prUnits = TRUE,
sep = '/', abbreviate.dimnames = FALSE,
prefix.width = max(nchar(lab)), min.colwidth, formatArgs=NULL, round=NULL,
prtest = c('P','stat','df','name'), prmsd = FALSE, long = FALSE,
pdig = 3, eps = 0.001, ...)
## S3 method for class 'summary.formula.reverse'
plot(x, vnames = c('labels', 'names'), what = c('proportion', '%'),
which = c('both', 'categorical', 'continuous'),
xlim = if(what == 'proportion') c(0,1)
else c(0,100),
xlab = if(what=='proportion') 'Proportion'
else 'Percentage',
pch = c(16, 1, 2, 17, 15, 3, 4, 5, 0), exclude1 = TRUE,
dotfont = 1, main,
prtest = c('P', 'stat', 'df', 'name'), pdig = 3, eps = 0.001,
conType = c('dot', 'bp', 'raw'), cex.means = 0.5, ...)
## S3 method for class 'summary.formula.reverse'
latex(object, title = first.word(deparse(substitute(object))), digits,
prn = any(n != N), pctdig = 0, what=c('%', 'proportion'),
npct = c("numerator", "both", "denominator", "slash", "none"),
npct.size = 'scriptsize', Nsize = "scriptsize", exclude1 = TRUE,
vnames=c("labels", "names"), prUnits = TRUE, middle.bold = FALSE,
outer.size = "scriptsize", caption, rowlabel = "",
insert.bottom = TRUE, dcolumn = FALSE, formatArgs=NULL, round = NULL,
prtest = c('P', 'stat', 'df', 'name'), prmsd = FALSE,
msdsize = NULL, long = dotchart, pdig = 3, eps = 0.001,
auxCol = NULL, dotchart=FALSE, ...)
## S3 method for class 'summary.formula.cross'
print(x, twoway = nvar == 2, prnmiss = any(stats$Missing > 0), prn = TRUE,
abbreviate.dimnames = FALSE, prefix.width = max(nchar(v)),
min.colwidth, formatArgs = NULL, ...)
## S3 method for class 'summary.formula.cross'
latex(object, title = first.word(deparse(substitute(object))),
twoway = nvar == 2, prnmiss = TRUE, prn = TRUE,
caption=attr(object, "heading"), vnames=c("labels", "names"),
rowlabel="", ...)
stratify(..., na.group = FALSE, shortlabel = TRUE)
## S3 method for class 'summary.formula.cross'
formula(x, ...)
cumcategory(y)
conTestkw(group, x)
catTestchisq(tab)
ordTestpo(group, x)
Arguments
formula 
An R formula with additive effects. For 
x 
an object created by 
y 
a numeric, character, category, or factor vector for 
drop 
logical. If 
data 
name or number of a data frame. Default is the current frame. 
subset 
a logical vector or integer vector of subscripts used to specify the subset of data to use in the analysis. The default is to use all observations in the data frame. 
na.action 
function for handling missing data in the input data. The default is
a function defined here called 
fun 
function for summarizing data in each cell. Default is to take the
mean of each column of the possibly multivariate response variable.
You can specify 
method 
The default is The The 
overall 
For 
continuous 
specifies the threshold for when a variable is considered to be
continuous (when there are at least 
na.rm 

na.include 
for 
g 
number of quantile groups to use when variables are automatically
categorized with 
nmin 
if fewer than 
test 
applies if 
conTest 
a function of two arguments (grouping variable and a continuous
variable) that returns a list with components 
catTest 
a function of a frequency table (an integer matrix) that returns a
list with the same components as created by 
ordTest 
a function of a frequency table (an integer matrix) that returns a
list with the same components as created by 
... 
for 
object 
an object created by 
quant 
vector of quantiles to use for summarizing data with

vnames 
By default, tables and plots are usually labeled with variable labels
(see the 
pch 
vector of plotting characters to represent different groups, in order
of group levels. For 
superposeStrata 
If 
dotfont 
font for plotting points 
reset.par 
set to 
abbreviate.dimnames 
see 
prefix.width 
see 
min.colwidth 
minimum column width to use for boxes printed with 
formatArgs 
a list containing other arguments to pass to 
markdown 
for 
digits 
number of significant digits to print. Default is to use the current
value of the 
prn 
set to 
prnmiss 
set to 
what 
for 
pctdig 
number of digits to the right of the decimal place for printing percentages. The default is zero, so percents will be rounded to the nearest percent. 
npct 
specifies which counts are to be printed to the right of percentages.
The default is to print the frequency (numerator of the percent) in
parentheses. You can specify 
npct.size 
the size for typesetting 
Nsize 
When a second row of column headings is added showing sample sizes,

exclude1 
by default, 
prUnits 
set to 
sep 
character to use to separate quantiles when printing

prtest 
a vector of test statistic components to print if 
round 
for 
prmsd 
set to 
msdsize 
defaults to 
long 
set to 
pdig 
number of digits to the right of the decimal place for printing
Pvalues. Default is 
eps 
Pvalues less than 
auxCol 
an optional auxiliary column of information, right justified, to add
in front of statistics typeset by

twoway 
for 
which 
For 
conType 
For plotting 
cex.means 
character size for means in boxpercentile plots; default is .5 
xlim 
vector of length two specifying xaxis limits. For

xlab 
xaxis label 
add 
set to 
main 
a main title. For 
subtitles 
set to 
caption 
character string containing LaTeX table captions. 
title 
name of resulting LaTeX file omitting the 
trios 
If for 
rowlabel 
see 
cdec 
number of decimal places to the right of the decimal point for

ncaption 
set to 
i 
a vector of integers, or character strings containing variable names
to subset on. Note that each row subsetted on in an 
j 
a vector of integers representing column numbers 
middle.bold 
set to 
outer.size 
the font size for outer quantiles for 
insert.bottom 
set to 
dcolumn 
see 
na.group 
set to 
shortlabel 
set to 
dotchart 
set to 
group 
for 
tab 
for 
Value
summary.formula
returns a data frame or list depending on
method
. plot.summary.formula.reverse
returns the number
of pages of plots that were made.
Side Effects
plot.summary.formula.reverse
creates a function Key
and
Key2
in frame 0 that will draw legends.
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
References
Harrell FE (2007): Statistical tables and plots using S and LaTeX. Document available from https://hbiostat.org/R/Hmisc/summary.pdf.
See Also
mChoice
, smean.sd
, summarize
,
label
, strata
, dotchart2
,
print.char.matrix
, update
,
formula
, cut2
, llist
,
format.default
, latex
,
latexTranslate
bpplt
,
summaryM
, summary
Examples
options(digits=3)
set.seed(173)
sex < factor(sample(c("m","f"), 500, rep=TRUE))
age < rnorm(500, 50, 5)
treatment < factor(sample(c("Drug","Placebo"), 500, rep=TRUE))
# Generate a 3choice variable; each of 3 variables has 5 possible levels
symp < c('Headache','Stomach Ache','Hangnail',
'Muscle Ache','Depressed')
symptom1 < sample(symp, 500,TRUE)
symptom2 < sample(symp, 500,TRUE)
symptom3 < sample(symp, 500,TRUE)
Symptoms < mChoice(symptom1, symptom2, symptom3, label='Primary Symptoms')
table(Symptoms)
# Note: In this example, some subjects have the same symptom checked
# multiple times; in practice these redundant selections would be NAs
# mChoice will ignore these redundant selections
#Frequency table sex*treatment, sex*Symptoms
summary(sex ~ treatment + Symptoms, fun=table)
# could also do summary(sex ~ treatment +
# mChoice(symptom1,symptom2,symptom3), fun=table)
#Compute mean age, separately by 3 variables
summary(age ~ sex + treatment + Symptoms)
f < summary(treatment ~ age + sex + Symptoms, method="reverse", test=TRUE)
f
# trio of numbers represent 25th, 50th, 75th percentile
print(f, long=TRUE)
plot(f)
plot(f, conType='bp', prtest='P')
bpplt() # annotated example showing layout of bp plot
#Compute predicted probability from a logistic regression model
#For different stratifications compute receiver operating
#characteristic curve areas (Cindexes)
predicted < plogis(.4*(sex=="m")+.15*(age50))
positive.diagnosis < ifelse(runif(500)<=predicted, 1, 0)
roc < function(z) {
x < z[,1];
y < z[,2];
n < length(x);
if(n<2)return(c(ROC=NA));
n1 < sum(y==1);
c(ROC= (mean(rank(x)[y==1])(n1+1)/2)/(nn1) );
}
y < cbind(predicted, positive.diagnosis)
options(digits=2)
summary(y ~ age + sex, fun=roc)
options(digits=3)
summary(y ~ age + sex, fun=roc, method="cross")
#Use stratify() to produce a table in which time intervals go down the
#page and going across 3 continuous variables are summarized using
#quartiles, and are stratified by two treatments
set.seed(1)
d < expand.grid(visit=1:5, treat=c('A','B'), reps=1:100)
d$sysbp < rnorm(100*5*2, 120, 10)
label(d$sysbp) < 'Systolic BP'
d$diasbp < rnorm(100*5*2, 80, 7)
d$diasbp[1] < NA
d$age < rnorm(100*5*2, 50, 12)
g < function(y) {
N < apply(y, 2, function(w) sum(!is.na(w)))
h < function(x) {
qu < quantile(x, c(.25,.5,.75), na.rm=TRUE)
names(qu) < c('Q1','Q2','Q3')
c(N=sum(!is.na(x)), qu)
}
w < as.vector(apply(y, 2, h))
names(w) < as.vector( outer(c('N','Q1','Q2','Q3'), dimnames(y)[[2]],
function(x,y) paste(y,x)))
w
}
#Use na.rm=FALSE to count NAs separately by column
s < summary(cbind(age,sysbp,diasbp) ~ visit + stratify(treat),
na.rm=FALSE, fun=g, data=d)
#The result is very wide. Redo, putting treatment vertically
x < with(d, factor(paste('Visit', visit, treat)))
summary(cbind(age,sysbp,diasbp) ~ x, na.rm=FALSE, fun=g, data=d)
#Compose LaTeX code directly
g < function(y) {
h < function(x) {
qu < format(round(quantile(x, c(.25,.5,.75), na.rm=TRUE),1),nsmall=1)
paste('{\\scriptsize(',sum(!is.na(x)),
')} \\hfill{\\scriptsize ', qu[1], '} \\textbf{', qu[2],
'} {\\scriptsize ', qu[3],'}', sep='')
}
apply(y, 2, h)
}
s < summary(cbind(age,sysbp,diasbp) ~ visit + stratify(treat),
na.rm=FALSE, fun=g, data=d)
# latex(s, prn=FALSE)
## need option in latex to not print n
#Put treatment vertically
s < summary(cbind(age,sysbp,diasbp) ~ x, fun=g, data=d, na.rm=FALSE)
# latex(s, prn=FALSE)
#Plot estimated mean life length (assuming an exponential distribution)
#separately by levels of 4 other variables. Repeat the analysis
#by levels of a stratification variable, drug. Automatically break
#continuous variables into tertiles.
#We are using the default, method='response'
## Not run:
life.expect < function(y) c(Years=sum(y[,1])/sum(y[,2]))
attach(pbc)
require(survival)
S < Surv(follow.up.time, death)
s2 < summary(S ~ age + albumin + ascites + edema + stratify(drug),
fun=life.expect, g=3)
#Note: You can summarize other response variables using the same
#independent variables using e.g. update(s2, response~.), or you
#can change the list of independent variables using e.g.
#update(s2, response ~. ascites) or update(s2, .~.ascites)
#You can also print, typeset, or plot subsets of s2, e.g.
#plot(s2[c('age','albumin'),]) or plot(s2[1:2,])
s2 # invokes print.summary.formula.response
#Plot results as a separate dot chart for each of the 3 strata levels
par(mfrow=c(2,2))
plot(s2, cex.labels=.6, xlim=c(0,40), superposeStrata=FALSE)
#Typeset table, creating s2.tex
w < latex(s2, cdec=1)
#Typeset table but just print LaTeX code
latex(s2, file="") # useful for Sweave
#Take control of groups used for age. Compute 3 quartiles for
#both cholesterol and bilirubin (excluding observations that are missing
#on EITHER ONE)
age.groups < cut2(age, c(45,60))
g < function(y) apply(y, 2, quantile, c(.25,.5,.75))
y < cbind(Chol=chol,Bili=bili)
label(y) < 'Cholesterol and Bilirubin'
#You can give new column names that are not legal S names
#by enclosing them in quotes, e.g. 'Chol (mg/dl)'=chol
s < summary(y ~ age.groups + ascites, fun=g)
par(mfrow=c(1,2), oma=c(3,0,3,0)) # allow outer margins for overall
for(ivar in 1:2) { # title
isub < (1:3)+(ivar1)*3 # *3=number of quantiles/var.
plot(s3, which=isub, main='',
xlab=c('Cholesterol','Bilirubin')[ivar],
pch=c(91,16,93)) # [, closed circle, ]
}
mtext(paste('Quartiles of', label(y)), adj=.5, outer=TRUE, cex=1.75)
#Overall (outer) title
prlatex(latex(s3, trios=TRUE))
# trios > collapse 3 quartiles
#Summarize only bilirubin, but do it with two statistics:
#the mean and the median. Make separate tables for the two randomized
#groups and make plots for the active arm.
g < function(y) c(Mean=mean(y), Median=median(y))
for(sub in c("Dpenicillamine", "placebo")) {
ss < summary(bili ~ age.groups + ascites + chol, fun=g,
subset=drug==sub)
cat('\n',sub,'\n\n')
print(ss)
if(sub=='Dpenicillamine') {
par(mfrow=c(1,1))
plot(s4, which=1:2, dotfont=c(1,1), subtitles=FALSE, main='')
#1=mean, 2=median 1 font = open circle
title(sub='Closed circle: mean; Open circle: median', adj=0)
title(sub=sub, adj=1)
}
w < latex(ss, append=TRUE, fi='my.tex',
label=if(sub=='placebo') 's4b' else 's4a',
caption=paste(label(bili),' {\\em (',sub,')}', sep=''))
#Note symbolic labels for tables for two subsets: s4a, s4b
prlatex(w)
}
#Now consider examples in 'reverse' format, where the lone dependent
#variable tells the summary function how to stratify all the
#'independent' variables. This is typically used to make tables
#comparing baseline variables by treatment group, for example.
s5 < summary(drug ~ bili + albumin + stage + protime + sex +
age + spiders,
method='reverse')
#To summarize all variables, use summary(drug ~., data=pbc)
#To summarize all variables with no stratification, use
#summary(~a+b+c) or summary(~.,data=\dots)
options(digits=1)
print(s5, npct='both')
#npct='both' : print both numerators and denominators
plot(s5, which='categorical')
Key(locator(1)) # draw legend at mouse click
par(oma=c(3,0,0,0)) # leave outer margin at bottom
plot(s5, which='continuous')
Key2() # draw legend at lower left corner of plot
# oma= above makes this default key fit the page better
options(digits=3)
w < latex(s5, npct='both', here=TRUE)
# creates s5.tex
#Turn to a different dataset and do crossclassifications on possibly
#more than one independent variable. The summary function with
#method='cross' produces a data frame containing the cross
#classifications. This data frame is suitable for multipanel
#trellis displays, although `summarize' works better for that.
attach(prostate)
size.quartile < cut2(sz, g=4)
bone < factor(bm,labels=c("no mets","bone mets"))
s7 < summary(ap>1 ~ size.quartile + bone, method='cross')
#In this case, quartiles are the default so could have said sz + bone
options(digits=3)
print(s7, twoway=FALSE)
s7 # same as print(s7)
w < latex(s7, here=TRUE) # Make s7.tex
library(trellis,TRUE)
invisible(ps.options(reset=TRUE))
trellis.device(postscript, file='demo2.ps')
dotplot(S ~ size.quartilebone, data=s7, #s7 is name of summary stats
xlab="Fraction ap>1", ylab="Quartile of Tumor Size")
#Can do this more quickly with summarize:
# s7 < summarize(ap>1, llist(size=cut2(sz, g=4), bone), mean,
# stat.name='Proportion')
# dotplot(Proportion ~ size  bone, data=s7)
summary(age ~ stage, method='cross')
summary(age ~ stage, fun=quantile, method='cross')
summary(age ~ stage, fun=smean.sd, method='cross')
summary(age ~ stage, fun=smedian.hilow, method='cross')
summary(age ~ stage, fun=function(x) c(Mean=mean(x), Median=median(x)),
method='cross')
#The next statements print real twoway tables
summary(cbind(age,ap) ~ stage + bone,
fun=function(y) apply(y, 2, quantile, c(.25,.75)),
method='cross')
options(digits=2)
summary(log(ap) ~ sz + bone,
fun=function(y) c(Mean=mean(y), quantile(y)),
method='cross')
#Summarize an ordered categorical response by all of the needed
#cumulative proportions
summary(cumcategory(disease.severity) ~ age + sex)
## End(Not run)