rmsOverview {rms}R Documentation

Overview of rms Package

Description

rms is the package that goes along with the book Regression Modeling Strategies. rms does regression modeling, testing, estimation, validation, graphics, prediction, and typesetting by storing enhanced model design attributes in the fit. rms is a re-written version of the Design package that has improved graphics and duplicates very little code in the survival package.

The package is a collection of about 180 functions that assist and streamline modeling, especially for biostatistical and epidemiologic applications. It also contains functions for binary and ordinal logistic regression models and the Buckley-James multiple regression model for right-censored responses, and implements penalized maximum likelihood estimation for logistic and ordinary linear models. rms works with almost any regression model, but it was especially written to work with logistic regression, Cox regression, accelerated failure time models, ordinary linear models, the Buckley-James model, generalized lease squares for longitudinal data (using the nlme package), generalized linear models, and quantile regression (using the quantreg package). rms requires the Hmisc package to be installed. Note that Hmisc has several functions useful for data analysis (especially data reduction and imputation).

Older references below pertaining to the Design package are relevant to rms.

Details

To make use of automatic typesetting features you must have LaTeX or one of its variants installed.

Some aspects of rms (e.g., latex) will not work correctly if options(contrasts=) other than c("contr.treatment", "contr.poly") are used.

rms relies on a wealth of survival analysis functions written by Terry Therneau of Mayo Clinic. Front-ends have been written for several of Therneau's functions, and other functions have been slightly modified.

Statistical Methods Implemented

Motivation

rms was motivated by the following needs:

Fitting Functions Compatible with rms

rms will work with a wide variety of fitting functions, but it is meant especially for the following:

Function Purpose Related S
Functions
ols Ordinary least squares linear model lm
lrm Binary and ordinal logistic regression glm
model cr.setup
orm Ordinal regression model lrm
blrm Bayesian binary and ordinal regression \
psm Accelerated failure time parametric survreg
survival model
cph Cox proportional hazards regression coxph
npsurv Nonparametric survival estimates survfit.formula
bj Buckley-James censored least squares survreg
linear model
Glm Version of glm for use with rms glm
Gls Version of gls for use with rms gls
Rq Version of rq for use with rms rq

Methods in rms

The following generic functions work with fits with rms in effect:

Function Purpose Related
Functions
print Print parameters and statistics of fit
coef Fitted regression coefficients
formula Formula used in the fit
specs Detailed specifications of fit
robcov Robust covariance matrix estimates
bootcov Bootstrap covariance matrix estimates
summary Summary of effects of predictors
plot.summary Plot continuously shaded confidence
bars for results of summary
anova Wald tests of most meaningful hypotheses
contrast General contrasts, C.L., tests
plot.anova Depict results of anova graphically dotchart
Predict Partial predictor effects predict
plot.Predict Plot predictor effects using lattice graphics predict
ggplot Similar to above but using ggplot2
plotp Similar to above but using plotly
bplot 3-D plot of effects of varying two
continuous predictors image, persp, contour
gendata Generate data frame with predictor expand.grid
combinations (optionally interactively)
predict Obtain predicted values or design matrix
fastbw Fast backward step-down variable step
selection
residuals Residuals, influence statistics from fit
(or resid)
which.influence Which observations are overly residuals
influential
sensuc Sensitivity of one binary predictor in
lrm and cph models to an unmeasured
binary confounder
latex LaTeX representation of fitted
model or anova or summary table
Function S function analytic representation Function.transcan
of a fitted regression model (X\beta)
hazard S function analytic representation rcspline.restate
of a fitted hazard function (for psm)
Survival S function analytic representation of
fitted survival function (for psm,cph)
Quantile S function analytic representation of
fitted function for quantiles of
survival time (for psm, cph)
nomogram Draws a nomogram for the fitted model latex, plot, ggplot, plotp
survest Estimate survival probabilities survfit
(for psm, cph)
survplot Plot survival curves (psm, cph, npsurv) plot.survfit
validate Validate indexes of model fit using val.prob
resampling
calibrate Estimate calibration curve for model
using resampling
vif Variance inflation factors for a fit
naresid Bring elements corresponding to missing
data back into predictions and residuals
naprint Print summary of missing values
pentrace Find optimum penality for penalized MLE
effective.df Print effective d.f. for each type of
variable in model, for penalized fit or
pentrace result
rm.impute Impute repeated measures data with transcan,
non-random dropout fit.mult.impute
experimental, non-functional

Background for Examples

The following programs demonstrate how the pieces of the rms package work together. A (usually) one-time call to the function datadist requires a pass at the entire data frame to store distribution summaries for potential predictor variables. These summaries contain (by default) the .25 and .75 quantiles of continuous variables (for estimating effects such as odds ratios), the 10th smallest and 10th largest values (or .1 and .9 quantiles for small n) for plotting ranges for estimated curves, and the total range. For discrete numeric variables (those having \leq 10 unique values), the list of unique values is also stored. Such summaries are used by the summary.rms, Predict, and nomogram.rms functions. You may save time and defer running datadist. In that case, the distribution summary is not stored with the fit object, but it can be gathered before running summary, plot, ggplot, or plotp.

d <- datadist(my.data.frame) # or datadist(x1,x2)
options(datadist="d") # omit this or use options(datadist=NULL)
# if not run datadist yet
cf <- ols(y ~ x1 * x2)
anova(f)
fastbw(f)
Predict(f, x2) predict(f, newdata)

In the Examples section there are three detailed examples using a fitting function designed to be used with rms, lrm (logistic regression model). In Detailed Example 1 we create 3 predictor variables and a two binary response on 500 subjects. For the first binary response, dz, the true model involves only sex and age, and there is a nonlinear interaction between the two because the log odds is a truncated linear relationship in age for females and a quadratic function for males. For the second binary outcome, dz.bp, the true population model also involves systolic blood pressure (sys.bp) through a truncated linear relationship. First, nonparametric estimation of relationships is done using the Hmisc package's plsmo function which uses lowess with outlier detection turned off for binary responses. Then parametric modeling is done using restricted cubic splines. This modeling does not assume that we know the true transformations for age or sys.bp but that these transformations are smooth (which is not actually the case in the population).

For Detailed Example 2, suppose that a categorical variable treat has values "a", "b", and "c", an ordinal variable num.diseases has values 0,1,2,3,4, and that there are two continuous variables, age and cholesterol. age is fitted with a restricted cubic spline, while cholesterol is transformed using the transformation log(cholesterol - 10). Cholesterol is missing on three subjects, and we impute these using the overall median cholesterol. We wish to allow for interaction between treat and cholesterol. The following S program will fit a logistic model, test all effects in the design, estimate effects, and plot estimated transformations. The fit for num.diseases really considers the variable to be a 5-level categorical variable. The only difference is that a 3 d.f. test of linearity is done to assess whether the variable can be re-modeled "asis". Here we also show statements to attach the rms package and store predictor characteristics from datadist.

Detailed Example 3 shows some of the survival analysis capabilities of rms related to the Cox proportional hazards model. We simulate data for 2000 subjects with 2 predictors, age and sex. In the true population model, the log hazard function is linear in age and there is no age \times sex interaction. In the analysis below we do not make use of the linearity in age. rms makes use of many of Terry Therneau's survival functions that are builtin to S.

The following is a typical sequence of steps that would be used with rms in conjunction with the Hmisc transcan function to do single imputation of all NAs in the predictors (multiple imputation would be better but would be harder to do in the context of bootstrap model validation), fit a model, do backward stepdown to reduce the number of predictors in the model (with all the severe problems this can entail), and use the bootstrap to validate this stepwise model, repeating the variable selection for each re-sample. Here we take a short cut as the imputation is not repeated within the bootstrap.

In what follows we (atypically) have only 3 candidate predictors. In practice be sure to have the validate and calibrate functions operate on a model fit that contains all predictors that were involved in previous analyses that used the response variable. Here the imputation is necessary because backward stepdown would otherwise delete observations missing on any candidate variable.

Note that you would have to define x1, x2, x3, y to run the following code.

xt <- transcan(~ x1 + x2 + x3, imputed=TRUE)
impute(xt) # imputes any NAs in x1, x2, x3
# Now fit original full model on filled-in data
f <- lrm(y ~ x1 + rcs(x2,4) + x3, x=TRUE, y=TRUE) #x,y allow boot.
fastbw(f)
# derives stepdown model (using default stopping rule)
validate(f, B=100, bw=TRUE) # repeats fastbw 100 times
cal <- calibrate(f, B=100, bw=TRUE) # also repeats fastbw
plot(cal)

Common Problems to Avoid

  1. Don't have a formula like y ~ age + age^2. In S you need to connect related variables using a function which produces a matrix, such as pol or rcs. This allows effect estimates (e.g., hazard ratios) to be computed as well as multiple d.f. tests of association.

  2. Don't use poly or strata inside formulas used in rms. Use pol and strat instead.

  3. Almost never code your own dummy variables or interaction variables in S. Let S do this automatically. Otherwise, anova can't do its job.

  4. Almost never transform predictors outside of the model formula, as then plots of predicted values vs. predictor values, and other displays, would not be made on the original scale. Use instead something like y ~ log(cell.count+1), which will allow cell.count to appear on x-axes. You can get fancier, e.g., y ~ rcs(log(cell.count+1),4) to fit a restricted cubic spline with 4 knots in log(cell.count+1). For more complex transformations do something like f <- function(x) {
    ... various 'if' statements, etc.
    log(pmin(x,50000)+1)
    }
    fit1 <- lrm(death ~ f(cell.count))
    fit2 <- lrm(death ~ rcs(f(cell.count),4))
    }

  5. Don't put $ inside variable names used in formulas. Either attach data frames or use data=.

  6. Don't forget to use datadist. Try to use it at the top of your program so that all model fits can automatically take advantage if its distributional summaries for the predictors.

  7. Don't validate or calibrate models which were reduced by dropping "insignificant" predictors. Proper bootstrap or cross-validation must repeat any variable selection steps for each re-sample. Therefore, validate or calibrate models which contain all candidate predictors, and if you must reduce models, specify the option bw=TRUE to validate or calibrate.

  8. Dropping of "insignificant" predictors ruins much of the usual statistical inference for regression models (confidence limits, standard errors, P-values, \chi^2, ordinary indexes of model performance) and it also results in models which will have worse predictive discrimination.

Accessing the Package

Use require(rms).

Published Applications of rms and Regression Splines

Bug Reports

The author is willing to help with problems. Send E-mail to fh@fharrell.com. To report bugs, please do the following:

  1. If the bug occurs when running a function on a fit object (e.g., anova), attach a dump'd text version of the fit object to your note. If you used datadist but not until after the fit was created, also send the object created by datadist. Example: save(myfit,"/tmp/myfit.rda") will create an R binary save file that can be attached to the E-mail.

  2. If the bug occurs during a model fit (e.g., with lrm, ols, psm, cph), send the statement causing the error with a save'd version of the data frame used in the fit. If this data frame is very large, reduce it to a small subset which still causes the error.

Copyright Notice

GENERAL DISCLAIMER This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. In short: you may use this code any way you like, as long as you don't charge money for it, remove this notice, or hold anyone liable for its results. Also, please acknowledge the source and communicate changes to the author.

If this software is used is work presented for publication, kindly reference it using for example: Harrell FE (2009): rms: S functions for biostatistical/epidemiologic modeling, testing, estimation, validation, graphics, and prediction. Programs available from https://hbiostat.org/R/rms/. Be sure to reference other packages used as well as R itself.

Author(s)

Frank E Harrell Jr
Professor of Biostatistics
Vanderbilt University School of Medicine
Nashville, Tennessee
fh@fharrell.com

References

The primary resource for the rms package is Regression Modeling Strategies, second edition by FE Harrell (Springer-Verlag, 2015) and the web page https://hbiostat.org/R/rms/. See also the Statistics in Medicine articles by Harrell et al listed below for case studies of modeling and model validation using rms.

Several datasets useful for multivariable modeling with rms are found at https://hbiostat.org/data/.

Examples

## To run several comprehensive examples, run the following command
## Not run: 
demo(all, 'rms')

## End(Not run)

[Package rms version 6.8-0 Index]