redun {Hmisc} | R Documentation |
Redundancy Analysis
Description
Uses flexible parametric additive models (see areg
and its
use of regression splines), or alternatively to run a regular regression
after replacing continuous variables with ranks, to
determine how well each variable can be predicted from the remaining
variables. Variables are dropped in a stepwise fashion, removing the
most predictable variable at each step. The remaining variables are used
to predict. The process continues until no variable still in the list
of predictors can be predicted with an R^2
or adjusted R^2
of at least r2
or until dropping the variable with the highest
R^2
(adjusted or ordinary) would cause a variable that was dropped
earlier to no longer be predicted at least at the r2
level from
the now smaller list of predictors.
There is also an option qrank
to expand each variable into two
columns containing the rank and square of the rank. Whenever ranks are
used, they are computed as fractional ranks for numerical reasons.
Usage
redun(formula, data=NULL, subset=NULL, r2 = 0.9,
type = c("ordinary", "adjusted"), nk = 3, tlinear = TRUE,
rank=qrank, qrank=FALSE,
allcat=FALSE, minfreq=0, iterms=FALSE, pc=FALSE, pr = FALSE, ...)
## S3 method for class 'redun'
print(x, digits=3, long=TRUE, ...)
Arguments
formula |
a formula. Enclose a variable in |
data |
a data frame, which must be omitted if |
subset |
usual subsetting expression |
r2 |
ordinary or adjusted |
type |
specify |
nk |
number of knots to use for continuous variables. Use
|
tlinear |
set to |
rank |
set to |
qrank |
set to |
allcat |
set to |
minfreq |
For a binary or categorical variable, there must be at
least two categories with at least |
iterms |
set to |
pc |
if |
pr |
set to |
... |
arguments to pass to |
x |
an object created by |
digits |
number of digits to which to round |
long |
set to |
Details
A categorical variable is deemed
redundant if a linear combination of dummy variables representing it can
be predicted from a linear combination of other variables. For example,
if there were 4 cities in the data and each city's rainfall was also
present as a variable, with virtually the same rainfall reported for all
observations for a city, city would be redundant given rainfall (or
vice-versa; the one declared redundant would be the first one in the
formula). If two cities had the same rainfall, city
might be
declared redundant even though tied cities might be deemed non-redundant
in another setting. To ensure that all categories may be predicted well
from other variables, use the allcat
option. To ignore
categories that are too infrequent or too frequent, set minfreq
to a nonzero integer. When the number of observations in the category
is below this number or the number of observations not in the category
is below this number, no attempt is made to predict observations being
in that category individually for the purpose of redundancy detection.
Value
an object of class "redun"
including an element "scores"
, a numeric matrix with all transformed values when each variable was the dependent variable and the first canonical variate was computed
Author(s)
Frank Harrell
Department of Biostatistics
Vanderbilt University
fh@fharrell.com
See Also
areg
, dataframeReduce
,
transcan
, varclus
, r2describe
,
subselect::genetic
Examples
set.seed(1)
n <- 100
x1 <- runif(n)
x2 <- runif(n)
x3 <- x1 + x2 + runif(n)/10
x4 <- x1 + x2 + x3 + runif(n)/10
x5 <- factor(sample(c('a','b','c'),n,replace=TRUE))
x6 <- 1*(x5=='a' | x5=='c')
redun(~x1+x2+x3+x4+x5+x6, r2=.8)
redun(~x1+x2+x3+x4+x5+x6, r2=.8, minfreq=40)
redun(~x1+x2+x3+x4+x5+x6, r2=.8, allcat=TRUE)
# x5 is no longer redundant but x6 is
redun(~x1+x2+x3+x4+x5+x6, r2=.8, rank=TRUE)
redun(~x1+x2+x3+x4+x5+x6, r2=.8, qrank=TRUE)
# To help decode which variables made a particular variable redundant:
# r <- redun(...)
# r2describe(r$scores)