R: Cluster Analysis

cluster_analysis {parameters}

R Documentation

Cluster Analysis

Description

Compute hierarchical or kmeans cluster analysis and return the group assignment for each observation as vector.

Usage

cluster_analysis(
  x,
  n = NULL,
  method = "kmeans",
  include_factors = FALSE,
  standardize = TRUE,
  verbose = TRUE,
  distance_method = "euclidean",
  hclust_method = "complete",
  kmeans_method = "Hartigan-Wong",
  dbscan_eps = 15,
  iterations = 100,
  ...
)

Arguments

`x`	A data frame (with at least two variables), or a matrix (with at least two columns).
`n`	Number of clusters used for supervised cluster methods. If `NULL`, the number of clusters to extract is determined by calling `n_clusters()`. Note that this argument does not apply for unsupervised clustering methods like `dbscan`, `hdbscan`, `mixture`, `pvclust`, or `pamk`.
`method`	Method for computing the cluster analysis. Can be `"kmeans"` (default; k-means using `kmeans()`), `"hkmeans"` (hierarchical k-means using `factoextra::hkmeans()`), `pam` (K-Medoids using `cluster::pam()`), `pamk` (K-Medoids that finds out the number of clusters), `"hclust"` (hierarchical clustering using `hclust()` or `pvclust::pvclust()`), `dbscan` (DBSCAN using `dbscan::dbscan()`), `hdbscan` (Hierarchical DBSCAN using `dbscan::hdbscan()`), or `mixture` (Mixture modeling using `mclust::Mclust()`, which requires the user to run `library(mclust)` before).
`include_factors`	Logical, if `TRUE`, factors are converted to numerical values in order to be included in the data for determining the number of clusters. By default, factors are removed, because most methods that determine the number of clusters need numeric input only.
`standardize`	Standardize the dataframe before clustering (default).
`verbose`	Toggle warnings and messages.
`distance_method`	Distance measure to be used for methods based on distances (e.g., when `method = "hclust"` for hierarchical clustering. For other methods, such as `"kmeans"`, this argument will be ignored). Must be one of `"euclidean"`, `"maximum"`, `"manhattan"`, `"canberra"`, `"binary"` or `"minkowski"`. See `dist()` and `pvclust::pvclust()` for more information.
`hclust_method`	Agglomeration method to be used when `method = "hclust"` or `method = "hkmeans"` (for hierarchical clustering). This should be one of `"ward"`, `"ward.D2"`, `"single"`, `"complete"`, `"average"`, `"mcquitty"`, `"median"` or `"centroid"`. Default is `"complete"` (see `hclust()`).
`kmeans_method`	Algorithm used for calculating kmeans cluster. Only applies, if `method = "kmeans"`. May be one of `"Hartigan-Wong"` (default), `"Lloyd"` (used by SPSS), or `"MacQueen"`. See `kmeans()` for details on this argument.
`dbscan_eps`	The `eps` argument for DBSCAN method. See `n_clusters_dbscan()`.
`iterations`	The number of replications.
`...`	Arguments passed to or from other methods.

Details

The print() and plot() methods show the (standardized) mean value for each variable within each cluster. Thus, a higher absolute value indicates that a certain variable characteristic is more pronounced within that specific cluster (as compared to other cluster groups with lower absolute mean values).

Clusters classification can be obtained via print(x, newdata = NULL, ...).

Value

The group classification for each observation as vector. The returned vector includes missing values, so it has the same length as nrow(x).

Note

There is also a plot()-method implemented in the see-package.

References

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2014) cluster: Cluster Analysis Basics and Extensions. R package.

Examples

set.seed(33)
# K-Means ====================================================
rez <- cluster_analysis(iris[1:4], n = 3, method = "kmeans")
rez # Show results
predict(rez) # Get clusters
summary(rez) # Extract the centers values (can use 'plot()' on that)
if (requireNamespace("MASS", quietly = TRUE)) {
  cluster_discrimination(rez) # Perform LDA
}

# Hierarchical k-means (more robust k-means)
if (require("factoextra", quietly = TRUE)) {
  rez <- cluster_analysis(iris[1:4], n = 3, method = "hkmeans")
  rez # Show results
  predict(rez) # Get clusters
}

# Hierarchical Clustering (hclust) ===========================
rez <- cluster_analysis(iris[1:4], n = 3, method = "hclust")
rez # Show results
predict(rez) # Get clusters

# K-Medoids (pam) ============================================
if (require("cluster", quietly = TRUE)) {
  rez <- cluster_analysis(iris[1:4], n = 3, method = "pam")
  rez # Show results
  predict(rez) # Get clusters
}

# PAM with automated number of clusters
if (require("fpc", quietly = TRUE)) {
  rez <- cluster_analysis(iris[1:4], method = "pamk")
  rez # Show results
  predict(rez) # Get clusters
}

# DBSCAN ====================================================
if (require("dbscan", quietly = TRUE)) {
  # Note that you can assimilate more outliers (cluster 0) to neighbouring
  # clusters by setting borderPoints = TRUE.
  rez <- cluster_analysis(iris[1:4], method = "dbscan", dbscan_eps = 1.45)
  rez # Show results
  predict(rez) # Get clusters
}

# Mixture ====================================================
if (require("mclust", quietly = TRUE)) {
  library(mclust) # Needs the package to be loaded
  rez <- cluster_analysis(iris[1:4], method = "mixture")
  rez # Show results
  predict(rez) # Get clusters
}

[Package parameters version 0.22.1 Index]