R: Interface functions for clustering methods

kmeansCBI {fpc}

R Documentation

Interface functions for clustering methods

Description

These functions provide an interface to several clustering methods implemented in R, for use together with the cluster stability assessment in clusterboot (as parameter clustermethod; "CBI" stands for "clusterboot interface"). In some situations it could make sense to use them to compute a clustering even if you don't want to run clusterboot, because some of the functions contain some additional features (e.g., normal mixture model based clustering of dissimilarity matrices projected into the Euclidean space by MDS or partitioning around medoids with estimated number of clusters, noise/outlier identification in hierarchical clustering).

Usage

kmeansCBI(data,krange,k,scaling=FALSE,runs=1,criterion="ch",...)

hclustCBI(data,k,cut="number",method,scaling=TRUE,noisecut=0,...)

hclusttreeCBI(data,minlevel=2,method,scaling=TRUE,...)

disthclustCBI(dmatrix,k,cut="number",method,noisecut=0,...)


noisemclustCBI(data,G,k,modelNames,nnk,hcmodel=NULL,Vinv=NULL,
                        summary.out=FALSE,...)

distnoisemclustCBI(dmatrix,G,k,modelNames,nnk,
                        hcmodel=NULL,Vinv=NULL,mdsmethod="classical",
                        mdsdim=4, summary.out=FALSE, points.out=FALSE,...)

claraCBI(data,k,usepam=TRUE,diss=inherits(data,"dist"),...)

pamkCBI(data,krange=2:10,k=NULL,criterion="asw", usepam=TRUE,
        scaling=FALSE,diss=inherits(data,"dist"),...)

tclustCBI(data,k,trim=0.05,...)

dbscanCBI(data,eps,MinPts,diss=inherits(data,"dist"),...)

mahalCBI(data,clustercut=0.5,...)

mergenormCBI(data, G=NULL, k=NULL, modelNames=NULL, nnk=0,
                         hcmodel = NULL,
                         Vinv = NULL, mergemethod="bhat",
                         cutoff=0.1,...)

speccCBI(data,k,...)

pdfclustCBI(data,...)


stupidkcentroidsCBI(dmatrix,k,distances=TRUE)

stupidknnCBI(dmatrix,k)

stupidkfnCBI(dmatrix,k)

stupidkavenCBI(dmatrix,k)

Arguments

`data`	a numeric matrix. The data matrix - usually a casesvariables-data matrix. `claraCBI`, `pamkCBI` and `dbscanCBI` work with an `nn`-dissimilarity matrix as well, see parameter `diss`.
`dmatrix`	a squared numerical dissimilarity matrix or a `dist`-object.
`k`	numeric, usually integer. In most cases, this is the number of clusters for methods where this is fixed. For `hclustCBI` and `disthclustCBI` see parameter `cut` below. Some methods have a `k` parameter on top of a `G` or `krange` parameter for compatibility; `k` in these cases does not have to be specified but if it is, it is always a single number of clusters and overwrites `G` and `krange`.
`scaling`	either a logical value or a numeric vector of length equal to the number of variables. If `scaling` is a numeric vector with length equal to the number of variables, then each variable is divided by the corresponding value from `scaling`. If `scaling` is `TRUE` then scaling is done by dividing the (centered) variables by their root-mean-square, and if `scaling` is `FALSE`, no scaling is done before execution.
`runs`	integer. Number of random initializations from which the k-means algorithm is started.
`criterion`	`"ch"` or `"asw"`. Decides whether number of clusters is estimated by the Calinski-Harabasz criterion or by the average silhouette width.
`cut`	either "level" or "number". This determines how `cutree` is used to obtain a partition from a hierarchy tree. `cut="level"` means that the tree is cut at a particular dissimilarity level, `cut="number"` means that the tree is cut in order to obtain a fixed number of clusters. The parameter `k` specifies the number of clusters or the dissimilarity level, depending on `cut`.
`method`	method for hierarchical clustering, see the documentation of `hclust`.
`noisecut`	numeric. All clusters of size `<=noisecut` in the `disthclustCBI`/`hclustCBI`-partition are joined and declared as noise/outliers.
`minlevel`	integer. `minlevel=1` means that all clusters in the tree are given out by `hclusttreeCBI` or `disthclusttreeCBI`, including one-point clusters (but excluding the cluster with all points). `minlevel=2` excludes the one-point clusters. `minlevel=3` excludes the two-point cluster which has been merged first, and increasing the value of `minlevel` by 1 in all further steps means that the remaining earliest formed cluster is excluded.
`G`	vector of integers. Number of clusters or numbers of clusters used by `mclustBIC`. If `G` has more than one entry, the number of clusters is estimated by the BIC.
`modelNames`	vector of string. Models for covariance matrices, see documentation of `mclustBIC`.
`nnk`	integer. Tuning constant for `NNclean`, which is used to estimate the initial noise for `noisemclustCBI` and `distnoisemclustCBI`. See parameter `k` in the documentation of `NNclean`. `nnk=0` means that no noise component is fitted.
`hcmodel`	string or `NULL`. Determines the initialization of the EM-algorithm for `mclustBIC`. Documented in `hc`.
`Vinv`	numeric. See documentation of `mclustBIC`.
`summary.out`	logical. If `TRUE`, the result of `summary.mclustBIC` is added as component `mclustsummary` to the output of `noisemclustCBI` and `distnoisemclustCBI`.
`mdsmethod`	"classical", "kruskal" or "sammon". Determines the multidimensional scaling method to compute Euclidean data from a dissimilarity matrix. See `cmdscale`, `isoMDS` and `sammon`.
`mdsdim`	integer. Dimensionality of MDS solution.
`points.out`	logical. If `TRUE`, the matrix of MDS points is added as component `points` to the output of `noisemclustCBI`.
`usepam`	logical. If `TRUE`, the function `pam` is used for clustering, otherwise `clara`. `pam` is better, `clara` is faster.
`diss`	logical. If `TRUE`, `data` will be considered as a dissimilarity matrix. In `claraCBI`, this requires `usepam=TRUE`.
`krange`	vector of integers. Numbers of clusters to be compared.
`trim`	numeric between 0 and 1. Proportion of data points trimmed, i.e., assigned to noise. See `tclust` in the tclust package.
`eps`	numeric. The radius of the neighborhoods to be considered by `dbscan`.
`MinPts`	integer. How many points have to be in a neighborhood so that a point is considered to be a cluster seed? See documentation of `dbscan`.
`clustercut`	numeric between 0 and 1. If `fixmahal` is used for fuzzy clustering, a crisp partition is generated and points with cluster membership values above `clustercut` are considered as members of the corresponding cluster.
`mergemethod`	method for merging Gaussians, passed on as `method` to `mergenormals`.
`cutoff`	numeric between 0 and 1, tuning constant for `mergenormals`.
`distances`	logical (only for `stupidkcentroidsCBI`). If `FALSE`, `dmatrix` is interpreted as cases&variables data matrix.
`...`	further parameters to be transferred to the original clustering functions (not required).

Details

All these functions call clustering methods implemented in R to cluster data and to provide output in the format required by clusterboot. Here is a brief overview. For further details see the help pages of the involved clustering methods.

kmeansCBI: an interface to the function kmeansruns calling kmeans for k-means clustering. (kmeansruns allows the specification of several random initializations of the k-means algorithm and estimation of k by the Calinski-Harabasz index or the average silhouette width.)
hclustCBI: an interface to the function hclust for agglomerative hierarchical clustering with noise component (see parameter noisecut above). This function produces a partition and assumes a cases*variables matrix as input.
hclusttreeCBI: an interface to the function hclust for agglomerative hierarchical clustering. This function gives out all clusters belonging to the hierarchy (upward from a certain level, see parameter minlevel above).
disthclustCBI: an interface to the function hclust for agglomerative hierarchical clustering with noise component (see parameter noisecut above). This function produces a partition and assumes a dissimilarity matrix as input.
noisemclustCBI: an interface to the function mclustBIC, for normal mixture model based clustering. Warning: mclustBIC often has problems with multiple points. In clusterboot, it is recommended to use this together with multipleboot=FALSE.
distnoisemclustCBI: an interface to the function mclustBIC for normal mixture model based clustering. This assumes a dissimilarity matrix as input and generates a data matrix by multidimensional scaling first. Warning: mclustBIC often has problems with multiple points. In clusterboot, it is recommended to use this together with multipleboot=FALSE.
claraCBI: an interface to the functions pam and clara for partitioning around medoids.
pamkCBI: an interface to the function pamk calling pam for partitioning around medoids. The number of clusters is estimated by the Calinski-Harabasz index or by the average silhouette width.
tclustCBI: an interface to the function tclust in the tclust package for trimmed Gaussian clustering. This assumes a cases*variables matrix as input.
dbscanCBI: an interface to the function dbscan for density based clustering.
mahalCBI: an interface to the function fixmahal for fixed point clustering. This assumes a cases*variables matrix as input.
mergenormCBI: an interface to the function mergenormals for clustering by merging Gaussian mixture components. Unlike mergenormals, mergenormCBI includes the computation of the initial Gaussian mixture. This assumes a cases*variables matrix as input.
speccCBI: an interface to the function specc for spectral clustering. See the specc help page for additional tuning parameters. This assumes a cases*variables matrix as input.
pdfclustCBI: an interface to the function pdfCluster for density-based clustering. See the pdfCluster help page for additional tuning parameters. This assumes a cases*variables matrix as input.
stupidkcentroidsCBI: an interface to the function stupidkcentroids for random centroid-based clustering. See the stupidkcentroids help page. This can have a distance matrix as well as a cases*variables matrix as input, see parameter distances.
stupidknnCBI: an interface to the function stupidknn for random nearest neighbour clustering. See the stupidknn help page. This assumes a distance matrix as input.
stupidkfnCBI: an interface to the function stupidkfn for random farthest neighbour clustering. See the stupidkfn help page. This assumes a distance matrix as input.
stupidkavenCBI: an interface to the function stupidkaven for random average dissimilarity clustering. See the stupidkaven help page. This assumes a distance matrix as input.

Value

All interface functions return a list with the following components (there may be some more, see summary.out and points.out above):

`result`	clustering result, usually a list with the full output of the clustering method (the precise format doesn't matter); whatever you want to use later.
`nc`	number of clusters. If some points don't belong to any cluster, these are declared "noise". `nc` includes the "noise cluster", and there should be another component `nccl`, being the number of clusters not including the noise cluster.
`clusterlist`	this is a list consisting of a logical vectors of length of the number of data points (`n`) for each cluster, indicating whether a point is a member of this cluster (`TRUE`) or not. If a noise cluster is included, it should always be the last vector in this list.
`partition`	an integer vector of length `n`, partitioning the data. If the method produces a partition, it should be the clustering. This component is only used for plots, so you could do something like `rep(1,n)` for non-partitioning methods. If a noise cluster is included, `nc=nccl+1` and the noise cluster is cluster no. `nc`.
`clustermethod`	a string indicating the clustering method.

The output of some of the functions has further components:

`nccl`	see `nc` above.
`nnk`	by `noisemclustCBI` and `distnoisemclustCBI`, see above.
`initnoise`	logical vector, indicating initially estimated noise by `NNclean`, called by `noisemclustCBI` and `distnoisemclustCBI`.
`noise`	logical. `TRUE` if points were classified as noise/outliers by `disthclustCBI`.

Author(s)

Christian Hennig christian.hennig@unibo.it https://www.unibo.it/sitoweb/christian.hennig/en/

Examples

  options(digits=3)
  set.seed(20000)
  face <- rFace(50,dMoNo=2,dNoEy=0,p=2)
  dbs <- dbscanCBI(face,eps=1.5,MinPts=4)
  dhc <- disthclustCBI(dist(face),method="average",k=1.5,noisecut=2)
  table(dbs$partition,dhc$partition)
  dm <- mergenormCBI(face,G=10,modelNames="EEE",nnk=2)
  dtc <- tclustCBI(face,6,trim=0.1,restr.fact=500)
  table(dm$partition,dtc$partition)

[Package fpc version 2.2-12 Index]