dbscan {dbscan}  R Documentation 
Fast reimplementation of the DBSCAN (Densitybased spatial clustering of applications with noise) clustering algorithm using a kdtree.
dbscan(x, eps, minPts = 5, weights = NULL, borderPoints = TRUE, ...)
is.corepoint(x, eps, minPts = 5, ...)
## S3 method for class 'dbscan_fast'
predict(object, newdata, data, ...)
x 
a data matrix, a data.frame, a dist object or a frNN object with fixedradius nearest neighbors. 
eps 
size (radius) of the epsilon neighborhood. Can be omitted if

minPts 
number of minimum points required in the eps neighborhood for core points (including the point itself). 
weights 
numeric; weights for the data points. Only needed to perform weighted clustering. 
borderPoints 
logical; should border points be assigned to clusters.
The default is 
... 
additional arguments are passed on to the fixedradius nearest
neighbor search algorithm. See 
object 
clustering object. 
newdata 
new data points for which the cluster membership should be predicted. 
data 
the data set used to create the clustering object. 
The
implementation is significantly faster and can work with larger data sets
than fpc::dbscan()
in fpc. Use dbscan::dbscan()
(with specifying the package) to
call this implementation when you also load package fpc.
The algorithm
This implementation of DBSCAN follows the original algorithm as described by Ester et al (1996). DBSCAN performs the following steps:
Estimate the density around each data point by counting the number of points in a userspecified epsneighborhood and applies a usedspecified minPts thresholds to identify core, border and noise points.
Core points are joined into a cluster if they are densityreachable (i.e., there is a chain of core points where one falls inside the epsneighborhood of the next).
Border points are assigned to clusters. The algorithm needs parameters
eps
(the radius of the epsilon neighborhood) and minPts
(the
density threshold).
Border points are arbitrarily assigned to clusters in the original
algorithm. DBSCAN* (see Campello et al 2013) treats all border points as
noise points. This is implemented with borderPoints = FALSE
.
Specifying the data
If x
is a matrix or a data.frame, then fast fixedradius nearest
neighbor computation using a kdtree is performed using Euclidean distance.
See frNN()
for more information on the parameters related to
nearest neighbor search. Note that only numerical values are allowed in x
.
Any precomputed distance matrix (dist object) can be specified as x
.
You may run into memory issues since distance matrices are large.
A precomputed frNN object can be supplied as x
. In this case
eps
does not need to be specified. This option us useful for large
data sets, where a sparse distance matrix is available. See
frNN()
how to create frNN objects.
Setting parameters for DBSCAN
The parameters minPts
and eps
depend on each other and
changing one typically requires changing the other one as well. The original
DBSCAN paper suggests to start by setting minPts
to the
dimensionality of the data plus one or higher. minPts
defines the
minimum density around a core point (i.e., the minimum density for nonnoise
areas). Increase the parameter to suppress more noise in the data and
require more points to form a cluster. A suitable neighborhood size
parameter eps
given a fixed value for minPts
can be found
visually by inspecting the kNNdistplot()
of the data using
k = minPts  1
(minPts
includes the point itself, while the
knearest neighbors distance does not). The knearest neighbor distance plot
sorts all data points by their knearest neighbor distance. A sudden
increase of the kNN distance (a knee) indicates that the points to the right
are most likely outliers. Choose eps
for DBSCAN where the knee is.
Predict cluster memberships
predict()
can be used to predict cluster memberships for new data
points. A point is considered a member of a cluster if it is within the eps
neighborhood of a core point of the cluster. Points
which cannot be assigned to a cluster will be reported as
noise points (i.e., cluster ID 0).
Important note: predict()
currently can only use Euclidean distance to determine
the neighborhood of core points. If dbscan()
was called using distances other than Euclidean,
then the neighborhood calculation will not be correct and only approximated by Euclidean
distances. If the data contain factor columns (e.g., using Gower's distance), then
the factors in data
and query
first need to be converted to numeric to use the
Euclidean approximation.
dbscan()
returns an object of class dbscan_fast
with the following components:
eps 
value of the 
minPts 
value of the 
cluster 
A integer vector with cluster assignments. Zero indicates noise points. 
is.corepoint()
returns a logical vector indicating for each data point if it is a
core point.
Michael Hahsler
Hahsler M, Piekenbrock M, Doran D (2019). dbscan: Fast DensityBased Clustering with R. Journal of Statistical Software, 91(1), 130. doi:10.18637/jss.v091.i01
Martin Ester, HansPeter Kriegel, Joerg Sander, Xiaowei Xu (1996). A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD96), 226231. https://dl.acm.org/doi/10.5555/3001460.3001507
Campello, R. J. G. B.; Moulavi, D.; Sander, J. (2013). DensityBased Clustering Based on Hierarchical Density Estimates. Proceedings of the 17th PacificAsia Conference on Knowledge Discovery in Databases, PAKDD 2013, Lecture Notes in Computer Science 7819, p. 160. doi:10.1007/9783642374562_14
Other clustering functions:
extractFOSC()
,
hdbscan()
,
jpclust()
,
optics()
,
sNNclust()
## Example 1: use dbscan on the iris data set
data(iris)
iris < as.matrix(iris[, 1:4])
## Find suitable DBSCAN parameters:
## 1. We use minPts = dim + 1 = 5 for iris. A larger value can also be used.
## 2. We inspect the kNN distance plot for k = minPts  1 = 4
kNNdistplot(iris, minPts = 5)
## Noise seems to start around a 4NN distance of .7
abline(h=.7, col = "red", lty = 2)
## Cluster with the chosen parameters
res < dbscan(iris, eps = .7, minPts = 5)
res
pairs(iris, col = res$cluster + 1L)
## Use a precomputed frNN object
fr < frNN(iris, eps = .7)
dbscan(fr, minPts = 5)
## Example 2: use data from fpc
set.seed(665544)
n < 100
x < cbind(
x = runif(10, 0, 10) + rnorm(n, sd = 0.2),
y = runif(10, 0, 10) + rnorm(n, sd = 0.2)
)
res < dbscan(x, eps = .3, minPts = 3)
res
## plot clusters and add noise (cluster 0) as crosses.
plot(x, col = res$cluster)
points(x[res$cluster == 0, ], pch = 3, col = "grey")
hullplot(x, res)
## Predict cluster membership for new data points
## (Note: 0 means it is predicted as noise)
newdata < x[1:5,] + rnorm(10, 0, .3)
hullplot(x, res)
points(newdata, pch = 3 , col = "red", lwd = 3)
text(newdata, pos = 1)
pred_label < predict(res, newdata, data = x)
pred_label
points(newdata, col = pred_label + 1L, cex = 2, lwd = 2)
## Compare speed against fpc version (if microbenchmark is installed)
## Note: we use dbscan::dbscan to make sure that we do now run the
## implementation in fpc.
## Not run:
if (requireNamespace("fpc", quietly = TRUE) &&
requireNamespace("microbenchmark", quietly = TRUE)) {
t_dbscan < microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3), times = 10, unit = "ms")
t_dbscan_linear < microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3, search = "linear"), times = 10, unit = "ms")
t_dbscan_dist < microbenchmark::microbenchmark(
dbscan::dbscan(x, .3, 3, search = "dist"), times = 10, unit = "ms")
t_fpc < microbenchmark::microbenchmark(
fpc::dbscan(x, .3, 3), times = 10, unit = "ms")
r < rbind(t_fpc, t_dbscan_dist, t_dbscan_linear, t_dbscan)
r
boxplot(r,
names = c('fpc', 'dbscan (dist)', 'dbscan (linear)', 'dbscan (kdtree)'),
main = "Runtime comparison in ms")
## speedup of the kdtreebased version compared to the fpc implementation
median(t_fpc$time) / median(t_dbscan$time)
}
## End(Not run)
## Example 3: manually create a frNN object for dbscan (dbscan only needs ids and eps)
nn < structure(list(ids = list(c(2,3), c(1,3), c(1,2,3), c(3,5), c(4,5)), eps = 1),
class = c("NN", "frNN"))
nn
dbscan(nn, minPts = 2)