clustering {metan} | R Documentation |
Clustering analysis
Description
Performs clustering analysis with selection of variables.
Usage
clustering(
.data,
...,
by = NULL,
scale = FALSE,
selvar = FALSE,
verbose = TRUE,
distmethod = "euclidean",
clustmethod = "average",
nclust = NA
)
Arguments
.data |
The data to be analyzed. It can be a data frame, possible with
grouped data passed from |
... |
The variables in |
by |
One variable (factor) to compute the function by. It is a shortcut
to |
scale |
Should the data be scaled before computing the distances? Set to FALSE. If TRUE, then, each observation will be divided by the standard deviation of the variable \(Z_{ij} = X_{ij} / sd_j\) |
selvar |
Logical argument, set to |
verbose |
Logical argument. If |
distmethod |
The distance measure to be used. This must be one of
|
clustmethod |
The agglomeration method to be used. This should be one of
|
nclust |
The number of clusters to be formed. Set to |
Details
When selvar = TRUE
a variable selection algorithm is executed. The
objective is to select a group of variables that most contribute to explain
the variability of the original data. The selection of the variables is based
on eigenvalue/eigenvectors solution based on the following steps.
compute the distance matrix and the cophenetic correlation with the original variables (all numeric variables in dataset);
compute the eigenvalues and eigenvectors of the correlation matrix between the variables;
Delete the variable with the largest weight (highest eigenvector in the lowest eigenvalue);
Compute the distance matrix and cophenetic correlation with the remaining variables;
Compute the Mantel's correlation between the obtained distances matrix and the original distance matrix;
Iterate steps 2 to 5 p - 2 times, where p is the number of original variables.
At the end of the p - 2 iterations, a summary of the models is returned. The distance is calculated with the variables that generated the model with the largest cophenetic correlation. I suggest a careful evaluation aiming at choosing a parsimonious model, i.e., the one with the fewer number of variables, that presents acceptable cophenetic correlation and high similarity with the original distances.
Value
-
data The data that was used to compute the distances.
-
cutpoint The cutpoint of the dendrogram according to Mojena (1977).
-
distance The matrix with the distances.
-
de The distances in an object of class
dist
. -
hc The hierarchical clustering.
-
Sqt The total sum of squares.
-
tab A table with the clusters and similarity.
-
clusters The sum of square and the mean of the clusters for each variable.
-
cofgrap If
selectvar = TRUE
, then,cofpgrap
is a ggplot2-based graphic showing the cophenetic correlation for each model (with different number of variables). Else, will be aNULL
object. -
statistics If
selectvar = TRUE
, then,statistics
shows the summary of the models fitted with different number of variables, including cophenetic correlation, Mantel's correlation with the original distances (all variables) and the p-value associated with the Mantel's test. Else, will be aNULL
object.
Author(s)
Tiago Olivoto tiagoolivoto@gmail.com
References
Mojena, R. 2015. Hierarchical grouping methods and stopping rules: an evaluation. Comput. J. 20:359-363. doi:10.1093/comjnl/20.4.359
Examples
library(metan)
# All rows and all numeric variables from data
d1 <- clustering(data_ge2)
# Based on the mean for each genotype
mean_gen <-
data_ge2 %>%
mean_by(GEN) %>%
column_to_rownames("GEN")
d2 <- clustering(mean_gen)
# Select variables for compute the distances
d3 <- clustering(mean_gen, selvar = TRUE)
# Compute the distances with standardized data
# Define 4 clusters
d4 <- clustering(data_ge,
by = ENV,
scale = TRUE,
nclust = 4)