R: Hierarchical Clustering for 1D

hclust1d {hclust1d}

R Documentation

Hierarchical Clustering for 1D

Description

Univariate hierarchical agglomerative clustering routine with a few possible choices of a linkage function.

Usage

hclust1d(x, distance = FALSE, squared = FALSE, method = "complete")

Arguments

`x`	a vector of 1D points to be clustered, or a distance structure as produced by `dist`.
`distance`	a logical value indicating, whether `x` is a vector of 1D points to be clustered (`distance = FALSE`, the default), or a distance structure (`distance = TRUE`).
`squared`	a logical value indicating, whether `distance` is squared (`squared = TRUE`) or not (`squared = FALSE`, the default). Its value is irrelevant for `distance = FALSE` setting.
`method`	linkage method, with `"complete"` as a default. See `supported_methods` for the complete list.

Details

If x is a distance matrix, the first step of the algorithm is computing a conforming vector of 1D points (with arbitrary shift and sign choices).

Univariate hierarchical clustering is performed for the provided or calculated vector of points: initially, each point is assigned its own singleton cluster, and then the clusters get merged with their nearest neighbors, two at a time.

For method = "single", there is no need to recompute distances, since the original inter-point distances are also the inter-cluster distances, so the algorithm requires only sorting the original points and then sorting the distances.

For other linkage methods, two distances (between the merged cluster and the preceding and the following clusters) get recomputed at each merge, and the resulting distance structure gets updated in an efficiently implemented heap providing a priority queue functionality (the access to the current minimum distance) in O(log n) time at each step. The resulting algorithm has O(n*log n) time complexity.

Value

A list object with S3 class "hclust", compatible with a regular stats::hclust output:

`merge`	a matrix with n-1 rows and 2 columns. Each i-th row of the matrix details merging performed at the i-th step of the algorithm. If the singleton cluster was merged at this step, the value of the element is negative and its absolute value equals the index of this point. Otherwise, a positive value, say j, of an element in i-th row, indicates that at the stage i a cluster created at a previous stage j was merged.
`height`	a vector with n-1 values, with the i-th value indicating the distance between the two clusters merged at the i-th step of the algorithm.
`order`	a permutation of the input points sorting them in an increasing order. Since the sign of points computed from the distance structure can be arbitrarily chosen, in the case of a distance structure input, the order can be increasing or decreasing.
`labels`	either point names, or point values, or point indices, in the order of availability.
`call`	the call which produced the results.
`method`	the linkage method used for clustering.
`dist.method`	the distance method used in building the distance matrix; or `"euclidean"`, if `x` is a vector of 1D points

Note

Please note that in stats::hclust, the inter-cluster distances for ward.D, centroid and median linkages (returned as height) are squared euclidean distances between the relevant clusters' centroids, although that behavior is not well documented. This behavior is also in odds with other linkage methods, for which unsquared euclidean distances are returned. The implementation in hclust1d::hclust1d follows that behavior in full.

Also, stats::hclust expects squared euclidean distance structure as input for method="ward.D", method="centroid" and method="median", although the latter is not well documented, either. Squared distance is not a proper distance (a triangle inequality may not be maintained), so it should be considered dissimilarity instead.

To retain compatibility, hlust1d::hclust1d accepts x in a form of a squared euclidean distance structure between points as well (indicated by both distance and squared arguments set to TRUE). Also, note that hlust1d::hclust1d returns the same heights for unsquared proper distances in x (with distance=TRUE setting and the default squared=FALSE argument) and for x in a form of a vector of 1D points (with the default distance=FALSE argument). Please consult the Examples section below for further reference on that behavior.

Examples


# Faster replacements for
# stats::hclust(dist(rnorm(100))) with a default complete linkage
dendrogram <- hclust1d(rnorm(100))
dendrogram <- hclust1d(dist(rnorm(100)), distance = TRUE)

# Faster replacements for
# stats::hclust(dist(rnorm(100)), method = "average")
dendrogram <- hclust1d(rnorm(100), method = "average")
dendrogram <- hclust1d(dist(rnorm(100)), distance = TRUE, method = "average")

# Faster replacements for
# stats::hclust(dist(rnorm(100))^2, method = "centroid")
# Note that stats::hclust expects squared euclidean distance input for centroid linkage
# While in case of hclust1d, 3 below calls result in the equivalent output
dendrogram <- hclust1d(rnorm(100), method = "centroid")
dendrogram <- hclust1d(dist(rnorm(100)), distance = TRUE, method = "centroid")
dendrogram <- hclust1d(dist(rnorm(100))^2, distance = TRUE, squared = TRUE, method = "centroid")

# Faster replacements for
# stats::hclust(dist(rnorm(100))^2, method = "median")
# Note that stats::hclust expects squared euclidean distance input for median linkage
# While in case of hclust1d, 3 below calls result in the equivalent output
dendrogram <- hclust1d(rnorm(100), method = "median")
dendrogram <- hclust1d(dist(rnorm(100)), distance = TRUE, method = "median")
dendrogram <- hclust1d(dist(rnorm(100))^2, distance = TRUE, squared = TRUE, method = "median")

# Faster replacements for
# stats::hclust(dist(rnorm(100)), method = "mcquitty")
dendrogram <- hclust1d(rnorm(100), method = "mcquitty")
dendrogram <- hclust1d(dist(rnorm(100)), distance = TRUE, method = "mcquitty")

# Faster replacements for
# stats::hclust(dist(rnorm(100))^2, method = "ward.D")
# Note that stats::hclust expects squared euclidean distance input for ward.D linkage
# While in case of hclust1d, 3 below calls result in the equivalent output
dendrogram <- hclust1d(rnorm(100), method = "ward.D")
dendrogram <- hclust1d(dist(rnorm(100)), distance = TRUE, method = "ward.D")
dendrogram <- hclust1d(dist(rnorm(100))^2, distance = TRUE, squared = TRUE, method = "ward.D")

# Faster replacements for
# stats::hclust(dist(rnorm(100)), method = "ward.D2")
dendrogram <- hclust1d(rnorm(100), method = "ward.D2")
dendrogram <- hclust1d(dist(rnorm(100)), distance = TRUE, method = "ward.D2")

# Faster replacements for
# stats::hclust(dist(rnorm(100)), method = "single")
dendrogram <- hclust1d(rnorm(100), method = "single")
dendrogram <- hclust1d(dist(rnorm(100)), distance = TRUE, method = "single")

# A 1D-specific true median linkage
dendrogram <- hclust1d(rnorm(100), method = "true_median")

# Plotting the resulting dendrogram
plot(dendrogram)

[Package hclust1d version 0.1.1 Index]