ncut {NCutYX}R Documentation

Cluster the Columns of Y into K Groups Using the NCut Graph Measure.

Description

Builds a similarity matrix for the columns of Y and clusters them into K groups based on the NCut graph measure. Correlation, Euclidean and Gaussian distances can be used to construct the similarity matrix.

Usage

ncut(Y, K = 2, B = 30, N = 500, dist = "correlation", scale = TRUE,
  q = 0.1, sigma = 1)

Arguments

Y

is a n x p matrix of p variables and n observations. The p columns of Y will be clustered into K groups using NCut.

K

is the number of clusters.

B

is the number of iterations.

N

is the number of samples per iterations.

dist

is the type of distance metric for the construction of the similarity matrix. Options are 'gaussian', 'euclidean' and 'correlation', the latter being the default.

scale

equals TRUE if data Y is to be scaled with mean 0 and variance 1.

q

is the quantile used for the top results at each iterations.

sigma

is the bandwidth parameter when the dist metric chosen is 'gaussian' (default=0.1).

Details

The algorithm minimizes the NCut through the cross entropy method. The edges of the graph correspond to the entries of a similarity matrix constructed based on a correlation, euclidean or gaussian distance metric. The clusters correspond to partitions that minimize this NCut objective function.

Value

A list with the following components:

quantile

a vector of length N which contains the quantiles q at each iteration of the optimization algorithm.

cluster

a matrix representing the clustering result of dimension p times K, where p is the number of columns of Y.

ncut

the NCut measure for the cluster result.

Author(s)

Sebastian Jose Teran Hidalgo. Maintainer: Sebastian Jose Teran Hidalgo sebastianteranhidalgo@gmail.com.

References

Von Luxburg, Ulrike. "A tutorial on spectral clustering." Statistics and computing 17.4 (2007): 395-416.

Kroese, D. P., Rubinstein, R. Y., Cohen, I., Porotsky, S., & Taimre, T. (2013). "Cross-entropy method." In Encyclopedia of Operations Research and Management Science (pp. 326-333). Springer US.

Examples

# This sets up the initial parameters for the simulation.
library(MASS)
n=100 # Sample size
B=30 # Number of iterations in the simulated annealing algorithm.
p=50 # Number of columns of Y.

S=matrix(0.2,p,p)
S[1:(p/2),(p/2+1):p]=0
S[(p/2+1):p,1:(p/2)]=0
S=S-diag(diag(S))+diag(p)
mu=rep(0,p)

W0=matrix(1,p,p)
W0[1:(p/2),1:(p/2)]=0
W0[(p/2+1):p,(p/2+1):p]=0
Denum=sum(W0)

Y=mvrnorm(n, mu, S)
# NCut
Res=ncut(Y,
K=2,
B=30,
N=1000,
dist='correlation',
scale=TRUE,
q=0.2,
sigma=0.1)
Cx=Res[[2]]
f11=matrix(Cx[,1],p,1)
f12=matrix(Cx[,2],p,1)

errorL=sum((f11%*%t(f11))*W0)/Denum+sum((f12%*%t(f12))*W0)/Denum
# This is the true error of the clustering solution.
errorL

[Package NCutYX version 0.1.0 Index]