R: DBSCAN density reachability and connectivity clustering

dbscan {fpc}

R Documentation

DBSCAN density reachability and connectivity clustering

Description

Generates a density based clustering of arbitrary shape as introduced in Ester et al. (1996).

Usage

  dbscan(data, eps, MinPts = 5, scale = FALSE, method = c("hybrid", "raw",
    "dist"), seeds = TRUE, showplot = FALSE, countmode = NULL)
  ## S3 method for class 'dbscan'
print(x, ...)
  ## S3 method for class 'dbscan'
plot(x, data, ...)
  ## S3 method for class 'dbscan'
predict(object, data, newdata = NULL,
predict.max=1000, ...)

Arguments

`data`	data matrix, data.frame, dissimilarity matrix or `dist`-object. Specify `method="dist"` if the data should be interpreted as dissimilarity matrix or object. Otherwise Euclidean distances will be used.
`eps`	Reachability distance, see Ester et al. (1996).
`MinPts`	Reachability minimum no. of points, see Ester et al. (1996).
`scale`	scale the data if `TRUE`.
`method`	"dist" treats data as distance matrix (relatively fast but memory expensive), "raw" treats data as raw data and avoids calculating a distance matrix (saves memory but may be slow), "hybrid" expects also raw data, but calculates partial distance matrices (very fast with moderate memory requirements).
`seeds`	FALSE to not include the `isseed`-vector in the `dbscan`-object.
`showplot`	0 = no plot, 1 = plot per iteration, 2 = plot per subiteration.
`countmode`	NULL or vector of point numbers at which to report progress.
`x`	object of class `dbscan`.
`object`	object of class `dbscan`.
`newdata`	matrix or data.frame with raw data to predict.
`predict.max`	max. batch size for predictions.
`...`	Further arguments transferred to plot methods.

Details

Clusters require a minimum no of points (MinPts) within a maximum distance (eps) around one of its members (the seed). Any point within eps around any point which satisfies the seed condition is a cluster member (recursively). Some points may not belong to any clusters (noise).

We have clustered a 100.000 x 2 dataset in 40 minutes on a Pentium M 1600 MHz.

print.dbscan shows a statistic of the number of points belonging to the clusters that are seeds and border points.

plot.dbscan distinguishes between seed and border points by plot symbol.

Value

predict.dbscan gives out a vector of predicted clusters for the points in newdata.

dbscan gives out an object of class 'dbscan' which is a LIST with components

`cluster`	integer vector coding cluster membership with noise observations (singletons) coded as 0
`isseed`	logical vector indicating whether a point is a seed (not border, not noise)
`eps`	parameter eps
`MinPts`	parameter MinPts

Note

this is a simplified version of the original algorithm (no K-D-trees used), thus we have o(n^2) instead of o(n*log(n))

Author(s)

Jens Oehlschlaegel, based on a draft by Christian Hennig.

References

Martin Ester, Hans-Peter Kriegel, Joerg Sander, Xiaowei Xu (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Institute for Computer Science, University of Munich. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96).

Examples

  set.seed(665544)
  n <- 600
  x <- cbind(runif(10, 0, 10)+rnorm(n, sd=0.2), runif(10, 0, 10)+rnorm(n,
    sd=0.2))
  par(bg="grey40")
  ds <- dbscan(x, 0.2)
# run with showplot=1 to see how dbscan works.
  ds
  plot(ds, x)

  x2 <- matrix(0,nrow=4,ncol=2)
  x2[1,] <- c(5,2)
  x2[2,] <- c(8,3)
  x2[3,] <- c(4,4)
  x2[4,] <- c(9,9)
  predict(ds, x, x2)

  n <- 600
  x <- cbind((1:3)+rnorm(n, sd=0.2), (1:3)+rnorm(n, sd=0.2))

# Not run, but results from my machine are 0.105 - 0.068 - 0.255:
#  system.time(ds <- dbscan(x, 0.3, countmode=NULL, method="raw"))[3] 
#  system.time(dsb <- dbscan(x, 0.3, countmode=NULL, method="hybrid"))[3]
#  system.time(dsc <- dbscan(dist(x), 0.3, countmode=NULL,
#    method="dist"))[3]

[Package fpc version 2.2-12 Index]