R: Optimal Multi-channel Weighted Univariate Clustering

MultiChannel.WUC {Ckmeans.1d.dp}

R Documentation

Optimal Multi-channel Weighted Univariate Clustering

Description

Perform optimal multi-channel weighted univariate k-means clustering in linear time.

Usage

  MultiChannel.WUC(x, y,  k=c(1,9))

Arguments

`x`	a numeric vector of data to be clustered. All `NA` elements must be removed from `x` before calling this function. The function will run faster on sorted `x` (in non-decreasing order) than an unsorted input.
`y`	a numeric matrix of non-negative weights for each element in `x`. Columns of the matrix are channels. It is highly recommended to use positive (instead of zero) weights to account for the influence of every element. Weights strongly influence clustering results. When the number of clusters `k` is given as a range, the weights should be linearly scaled to sum up to the observed sample size.
`k`	either an exact integer number of clusters, or a vector of length two specifying the minimum and maximum numbers of clusters to be examined. The default is `c(1,9)`. When `k` is a range, the actual number of clusters is determined by Bayesian information criterion (BIC).

Details

MultiChannel.WUC minimizes the total weighted within-cluster sum of squared distance (Zhong 2019). It uses the SMAWK algorithm (Aggarwal et al. 1987) with modified data structure to speed up the dynamic programming to linear runtime. The method selects an optimal k based on an approximate Gaussian mixture model using the BIC.

Value

A list object containing the following components:

`cluster`	a vector of clusters assigned to each element in `x`. Each cluster is indexed by an integer from 1 to `k`.
`centers`	a numeric vector of the (weighted) means for each cluster.
`withinss`	a numeric vector of the (weighted) within-cluster sum of squares for each cluster.
`size`	a vector of the (weighted) number of elements in each cluster.
`totss`	total sum of (weighted) squared distances between each element and the sample mean. This statistic is not dependent on the clustering result.
`tot.withinss`	total sum of (weighted) within-cluster squared distances between each element and its cluster mean. This statistic is minimized given the number of clusters.
`betweenss`	sum of (weighted) squared distances between each cluster mean and sample mean. This statistic is maximized given the number of clusters.
`xname`	a character string. The actual name of the `x` argument.
`yname`	a character string. The actual name of the `y` argument.

Author(s)

Hua Zhong and Mingzhou Song

References

Aggarwal A, Klawe MM, Moran S, Shor P, Wilber R (1987). “Geometric applications of a matrix-searching algorithm.” Algorithmica, 2(1-4), 195–208. doi:10.1007/BF01840359.

Zhong H (2019). Model-free Gene-to-zone Network Inference of Molecular Mechanisms in Biology. Ph.D. thesis, Department of Computer Science, New Mexico State University, Las Cruces, NM, USA.

Examples

  x <- sample(x = c(1:100), size = 20, replace = TRUE)
  Y <- matrix(sample(x = c(1:100), size = 40, replace = TRUE), ncol=2, nrow=length(x))

  res <- MultiChannel.WUC(x = x, y = Y, k = c(1:10))
  plot(res)

  n <- c(20, 20, 20)
  x <- c(rnorm(n[1], mean=-6),
         rnorm(n[2], mean=0),
         rnorm(n[3], mean=6))

  Y <- matrix(c(
    rep(c(1,0,0), times=n[1]),
    rep(c(0,1,0), times=n[2]),
    rep(c(0,0,1), times=n[3])
  ), byrow=TRUE, nrow=length(x))

  res <- MultiChannel.WUC(x = x, y = Y, k = 3)

  opar <- par(mar=c(3,3,2.5,1), mgp=c(1.5,0.5,0))
  plot(res)
  par(opar)

[Package Ckmeans.1d.dp version 4.3.5 Index]