R: Function to identify groups of highly correlated variables...

corclust {klaR}

R Documentation

Function to identify groups of highly correlated variables for removing correlated features from the data for further analysis.

Description

A hierarchical clustering of variables using hclust is performed using 1 - the absolute correlation as a distance measure between tow variables.

Usage

corclust(x, cl = NULL, method = "complete")
## S3 method for class 'corclust'
plot(x, selection = "both", mincor = NULL, ...)

Arguments

`x`	Either a data frame or a matrix consisting of numerical attributes.
`cl`	Optional vector of ty factor indicating class levels, if class specific correlations should to be considered.
`method`	Linkage to be used for clustering. Default is `complete` linkage.
`selection`	If `"numeric"`, ‘1 - average absolute correlation within cluster’ is plotted, if `"factor"`, ‘1 - minimum Cramer's V within cluster’ is plotted. The default, `"both"`, generates both variations.
`mincor`	Adds a horizontal line for this correlation.
`...`	passed to underlying plot functions.

Details

Each cluster consists of a set of correlated variables according to the chosen clustering criterion. The default criterion is ‘complete’. This choice is meaningful as it represents the minimum absolute correlation between all variables of a cluster.
The data set is split into numerics and factors two separate clustering models are built, depending on the variable type. For factors distances are computed based on 1-Cramer's V statistic using chisq.test. For a large number of factor variables this might take some time. The resulting trees can be plotted using plot.
Further proceeding would consist in chosing one variable of each cluster to obtain a subset of rather uncorrelated variables for further analysis. An automatic variable selection can be done using cvtree and xtractvars.
If an additional class vector cl is given to the function for any two variables their minimum correlation over all classes is used.

Value

Object of class corclust.

`cor`	Correlation matrix of numeric variables.
`crv`	Matrix of Cramer's V for factor variables.
`cluster.numerics`	Resulting hierarchical `hclust` model for numeric variables.
`cluster.factors`	Resulting hierarchical `hclust` model for factor variables.
`id.numerics`	Variable IDs of numeric variables in `x`.
`id.factors`	Variable IDs of factor variables `x`.

Author(s)

Gero Szepannek

References

Roever, C. and Szepannek, G. (2005): Application of a genetic algorithm to variable selection in fuzzy clustering. In C. Weihs and W. Gaul (eds), Classification - The Ubiquitous Challenge, 674-681, Springer.

Examples

    data(iris)
    classes <- iris$Species
    variables <- iris[,1:4]
    ccres <- corclust(variables, classes)
    plot(ccres, mincor = 0.6)

[Package klaR version 1.7-3 Index]