varclus {Hmisc} | R Documentation |

## Variable Clustering

### Description

Does a hierarchical cluster analysis on variables, using the Hoeffding
D statistic, squared Pearson or Spearman correlations, or proportion
of observations for which two variables are both positive as similarity
measures. Variable clustering is used for assessing collinearity,
redundancy, and for separating variables into clusters that can be
scored as a single variable, thus resulting in data reduction. For
computing any of the three similarity measures, pairwise deletion of
NAs is done. The clustering is done by `hclust()`

. A small function
`naclus`

is also provided which depicts similarities in which
observations are missing for variables in a data frame. The
similarity measure is the fraction of `NAs`

in common between any two
variables. The diagonals of this `sim`

matrix are the fraction of NAs
in each variable by itself. `naclus`

also computes `na.per.obs`

, the
number of missing variables in each observation, and `mean.na`

, a
vector whose ith element is the mean number of missing variables other
than variable i, for observations in which variable i is missing. The
`naplot`

function makes several plots (see the `which`

argument).

So as to not generate too many dummy variables for multi-valued
character or categorical predictors, `varclus`

will automatically
combine infrequent cells of such variables using
`combine.levels`

.

`plotMultSim`

plots multiple similarity matrices, with the similarity
measure being on the x-axis of each subplot.

`na.pattern`

prints a frequency table of all combinations of
missingness for multiple variables. If there are 3 variables, a
frequency table entry labeled `110`

corresponds to the number of
observations for which the first and second variables were missing but
the third variable was not missing.

### Usage

```
varclus(x, similarity=c("spearman","pearson","hoeffding","bothpos","ccbothpos"),
type=c("data.matrix","similarity.matrix"),
method="complete",
data=NULL, subset=NULL, na.action=na.retain,
trans=c("square", "abs", "none"), ...)
## S3 method for class 'varclus'
print(x, abbrev=FALSE, ...)
## S3 method for class 'varclus'
plot(x, ylab, abbrev=FALSE, legend.=FALSE, loc, maxlen, labels, ...)
naclus(df, method)
naplot(obj, which=c('all','na per var','na per obs','mean na',
'na per var vs mean na'), ...)
plotMultSim(s, x=1:dim(s)[3],
slim=range(pretty(c(0,max(s,na.rm=TRUE)))),
slimds=FALSE,
add=FALSE, lty=par('lty'), col=par('col'),
lwd=par('lwd'), vname=NULL, h=.5, w=.75, u=.05,
labelx=TRUE, xspace=.35)
na.pattern(x)
```

### Arguments

`x` |
a formula,
a numeric matrix of predictors, or a similarity matrix. If For |

`df` |
a data frame |

`s` |
an array of similarity matrices. The third dimension of this array
corresponds to different computations of similarities. The first two
dimensions come from a single similarity matrix. This is useful for
displaying similarity matrices computed by |

`similarity` |
the default is to use squared Spearman correlation coefficients, which
will detect monotonic but nonlinear relationships. You can also
specify linear correlation or Hoeffding's (1948) D statistic, which
has the advantage of being sensitive to many types
of dependence, including highly non-monotonic relationships. For
binary data, or data to be made binary, |

`type` |
if |

`method` |
see |

`data` |
a data frame, data table, or list |

`subset` |
a standard subsetting expression |

`na.action` |
These may be specified if |

`trans` |
By default, when the similarity measure is based on
Pearson's or Spearman's correlation coefficients, the coefficients are
squared. Specify |

`...` |
for |

`ylab` |
y-axis label. Default is constructed on the basis of |

`legend.` |
set to |

`loc` |
a list with elements |

`maxlen` |
if a legend is plotted describing abbreviations, original labels
longer than |

`labels` |
a vector of character strings containing labels corresponding to columns in the similar matrix, if the column names of that matrix are not to be used |

`obj` |
an object created by |

`which` |
defaults to |

`abbrev` |
set to |

`slim` |
2-vector specifying the range of similarity values for scaling the
y-axes. By default this is the observed range over all of |

`slimds` |
set to |

`add` |
set to |

`lty` , `col` , `lwd` |
line type, color, or line thickness for |

`vname` |
optional vector of variable names, in order, used in |

`h` |
relative height for subplot |

`w` |
relative width for subplot |

`u` |
relative extra height and width to leave unused inside the subplot. Also used as the space between y-axis tick mark labels and graph border. |

`labelx` |
set to |

`xspace` |
amount of space, on a scale of 1: |

### Details

`options(contrasts= c("contr.treatment", "contr.poly"))`

is issued
temporarily by `varclus`

to make sure that ordinary dummy variables
are generated for `factor`

variables. Pass arguments to the
`dataframeReduce`

function to remove problematic variables
(especially if analyzing all variables in a data frame).

### Value

for `varclus`

or `naclus`

, a list of class `varclus`

with elements
`call`

(containing the calling statement), `sim`

(similarity matrix),
`n`

(sample size used if `x`

was not a correlation matrix already -
`n`

is a matrix), `hclust`

, the object created by `hclust`

,
`similarity`

, and `method`

. `naclus`

also returns the
two vectors listed under
description, and `naplot`

returns an invisible vector that is the
frequency table of the number of missing variables per observation.
`plotMultSim`

invisibly returns the limits of similarities used in
constructing the y-axes of each subplot. For `similarity="ccbothpos"`

the `hclust`

object is `NULL`

.

`na.pattern`

creates an integer vector of frequencies.

### Side Effects

plots

### Author(s)

Frank Harrell

Department of Biostatistics, Vanderbilt University

fh@fharrell.com

### References

Sarle, WS: The VARCLUS Procedure. SAS/STAT User's Guide, 4th Edition, 1990. Cary NC: SAS Institute, Inc.

Hoeffding W. (1948): A non-parametric test of independence. Ann Math Stat 19:546–57.

### See Also

`hclust`

, `plclust`

, `hoeffd`

, `rcorr`

, `cor`

, `model.matrix`

,
`locator`

, `na.pattern`

, `cut2`

, `combine.levels`

### Examples

```
set.seed(1)
x1 <- rnorm(200)
x2 <- rnorm(200)
x3 <- x1 + x2 + rnorm(200)
x4 <- x2 + rnorm(200)
x <- cbind(x1,x2,x3,x4)
v <- varclus(x, similarity="spear") # spearman is the default anyway
v # invokes print.varclus
print(round(v$sim,2))
plot(v)
# plot(varclus(~ age + sys.bp + dias.bp + country - 1), abbrev=TRUE)
# the -1 causes k dummies to be generated for k countries
# plot(varclus(~ age + factor(disease.code) - 1))
#
#
# use varclus(~., data= fracmiss= maxlevels= minprev=) to analyze all
# "useful" variables - see dataframeReduce for details about arguments
df <- data.frame(a=c(1,2,3),b=c(1,2,3),c=c(1,2,NA),d=c(1,NA,3),
e=c(1,NA,3),f=c(NA,NA,NA),g=c(NA,2,3),h=c(NA,NA,3))
par(mfrow=c(2,2))
for(m in c("ward","complete","median")) {
plot(naclus(df, method=m))
title(m)
}
naplot(naclus(df))
n <- naclus(df)
plot(n); naplot(n)
na.pattern(df)
# plotMultSim example: Plot proportion of observations
# for which two variables are both positive (diagonals
# show the proportion of observations for which the
# one variable is positive). Chance-correct the
# off-diagonals by subtracting the product of the
# marginal proportions. On each subplot the x-axis
# shows month (0, 4, 8, 12) and there is a separate
# curve for females and males
d <- data.frame(sex=sample(c('female','male'),1000,TRUE),
month=sample(c(0,4,8,12),1000,TRUE),
x1=sample(0:1,1000,TRUE),
x2=sample(0:1,1000,TRUE),
x3=sample(0:1,1000,TRUE))
s <- array(NA, c(3,3,4))
opar <- par(mar=c(0,0,4.1,0)) # waste less space
for(sx in c('female','male')) {
for(i in 1:4) {
mon <- (i-1)*4
s[,,i] <- varclus(~x1 + x2 + x3, sim='ccbothpos', data=d,
subset=d$month==mon & d$sex==sx)$sim
}
plotMultSim(s, c(0,4,8,12), vname=c('x1','x2','x3'),
add=sx=='male', slimds=TRUE,
lty=1+(sx=='male'))
# slimds=TRUE causes separate scaling for diagonals and
# off-diagonals
}
par(opar)
```

*Hmisc*version 5.1-3 Index]