adpclust {ADPclust}  R Documentation 
Fast Clustering Using Adaptive Density Peak Detection
Description
Clustering of data by finding cluster centers from estimated density peaks. ADPclust is a noniterative procedure that incorporates multivariate Gaussian density estimation. The number of clusters as well as bandwidths can either be selected by the user or selected automatically through an internal clustering criterion.
Usage
adpclust(x = NULL, distm = NULL, p = NULL, centroids = "auto",
h = NULL, htype = "amise", nclust = 2:10, ac = 1, f.cut = c(0.1,
0.2, 0.3), fdelta = "mnorm", dmethod = "euclidean", draw = FALSE)
Arguments
x 
numeric data frame where rows are observations and columns are variables. One of x and distm must be provided. 
distm 
distance matrix of class 'dist'. distm is ignored if x is given. 
p 
number of variables (ncol(x)). This is only needed if neither x nor h is given. 
centroids 
character string specifying how cluster centroids are selected. Valid options are "user" and "auto". 
h 
nonnegative number specifying the bandwidth in density estimation. If h is NULL, the algorithm attempts to find h in a neighborhood centered at either the AMISE bandwidth or ROT bandwidth (see htype). 
htype 
character string specifying the method used to calculate a reference bandwidth for the density estimation. htype is ignored if h is given. Valid options of are "ROT" and "AMISE" (see details). 
nclust 
integer, or a vector of integers specifying the pool of the number of clusters in automatic variation. The default is 2:10. 
ac 
integer indicating which automatic cut method is used. This is ignored if centroids = 'user'. The valid options are:

f.cut 
number between (0, 1) or numeric vector of numbers between (0, 1). f.cut is used when centroids = "auto" and ac = 1 to automatically select cluster centroids from the decision plot (see ac). The default is c(0.1, 0.2, 0.3). 
fdelta 
character string that specifies the method used to estimate local density f(x) at each data point x. The default (recommended) is "mnorm" that uses a multivariate Gaussian density estimation to calculate f. Other options are listed below. Here 'distm' denotes the distance matrix.

dmethod 
character string that is passed to the 'method' argument in function dist(), which is used to calculate the distance matrix if 'distm' is not given. The default is "euclidean". 
draw 
boolean. If draw = TRUE the clustering result is plotted after the algorithm finishes. The plot is produced by by plot.adpclust(ans), where 'ans' is the outcome of 'adpclust()' 
Details
Given n data points x's in p dimensions, adpclust() calculates f(x) and delta(x) for each data point x, where f(x) is the local density at x, and delta(x) is the shortest distance between x and y for all y such that f(x) <= f(y). Data points with large f and large delta values are labeled class centroids. In other words, they appear as isolated points in the upper right corner of the f vs. delta plot (the decision plot). After cluster centroids are determined, other data points are clustered according to their distances to the closes centroids.
A bandwidth (smoothing parameter) h is used to calculate local density f(x) in various ways. See parameter 'fdelta' for details. If centroids = 'user', then h must be explicitly provided. If centroids = 'auto' and h is not specified, then it is automatically selected from a range of testing values: First a reference bandwidth h0 is calculated by one of the two methods: Scott's RuleofThumb value (htype = "ROT") or Wand's AsymptoticMeanIntegratedSquaredError value (htype = "AMISE"), then 10 values equally spread in the range [1/3h0, 3h0] are tested. The value that yields the highest silhouette score is chosen as the final h.
Value
An 'adpclust' object that contains the list of the following items.
clusters Cluster assignments. A vector of the same length as the number of observations.
centers: Indices of the clustering centers.
silhouette: Silhouette score from the final clustering result.
nclust: Number of clusters.
h: Final bandwidth.
f: Final density vector f(x).
delta: Final delta vector delta(x).
selection.type: 'user' or 'auto'.
References
XiaoFeng Wang, and Yifan Xu, (2015) "Fast Clustering Using Adaptive Density Peak Detection." Statistical Methods in Medical Research, doi:10.1177/0962280215609948.
Examples
# Load a data set with 3 clusters
data(clust3)
# Automatically select cluster centroids
ans < adpclust(clust3, centroids = "auto", draw = FALSE)
summary(ans)
plot(ans)
# Specify distm instead of data
distm < FindDistm(clust3, normalize = TRUE)
ans.distm < adpclust(distm = distm, p = 2, centroids = "auto", draw = FALSE)
identical(ans, ans.distm)
# Specify the grid of h and nclust
ans < adpclust(clust3, centroids = "auto", h = c(0.1, 0.2, 0.3), nclust = 2:6)
# Specify that bandwidths should be searched around
# Wand's AsymptoticMeanIntegratedSquaredError bandwidth
# Also test 3 to 6 clusters.
ans < adpclust(clust3, centroids = "auto", htype = "AMISE", nclust = 3:6)
# Set a specific bandwidth value.
ans < adpclust(clust3, centroids = "auto", h = 5)
# Change method of automatic selection of centers
ans < adpclust(clust3, centroids = "auto", nclust = 2:6, ac = 2)
# Specify that the single "ROT" bandwidth value by
# using the 'ROT()' function
ans < adpclust(clust3, centroids = "auto", h = ROT(clust3))
# Centroids selected by user
## Not run:
ans < adpclust(clust3, centroids = "user", h = ROT(clust3))
## End(Not run)
# A larger data set
data(clust5)
ans < adpclust(clust5, centroids = "auto", htype = "ROT", nclust = 3:5)
summary(ans)
plot(ans)