R: Cluster the Columns of Y into K Groups with the Help of...

ancut {NCutYX}

R Documentation

Cluster the Columns of Y into K Groups with the Help of External Features X.

Description

This function will output K clusters of the columns of Y using the help of X.

Usage

ancut(Y, X, K = 2, B = 3000, L = 1000, alpha = 0.5, nlambdas = 100,
  sampling = "equal", ncv = 5, dist = "correlation", sigma = 0.1)

Arguments

`Y`	is a n x p matrix of p variables and n observations. The columns of Y will be clustered into K groups.
`X`	is a n x q matrix of q variables and n observations.
`K`	is the number of clusters.
`B`	is the number of iterations in the simulated annealing algorithm.
`L`	is the temperature coefficient in the simulated annealing algorithm.
`alpha`	is the coefficient of the elastic net penalty.
`nlambdas`	is the number of tuning parameters in the elastic net.
`sampling`	if 'equal' then the sampling probabilities is the same during the simulated annealing algorithm, if 'size' the probabilites are proportional the the sizes of the clusters in the current iterations.
`ncv`	is the number of cross-validations in the elastic net.
`dist`	is the type of distance metric for the construction of the similarity matrix. Options are 'gaussian', 'euclidean' and 'correlation', the latter being the default.
`sigma`	is the parameter for the gaussian kernel distance which is ignored if 'gaussian' is not chosen as distance measure.

Details

The algorithm minimizes a modified version of NCut through simulated annealing. The modified NCut uses in the numerator the similarity matrix of the original data Y and the denominator uses the similarity matrix of the prediction of Y using X. The clusters correspond to partitions that minimize this objective function. The external information of X is incorporated by using elastic net to predict Y.

Value

A list with the final value of the objective function, the clusters and the lambda penalty chosen through cross-validation.

A list with the following components:

loss: a vector of length N which contains the loss at each iteration of the simulated annealing algorithm.
cluster: a matrix representing the clustering result of dimension p times K, where p is the number of columns of Y.
lambda.min: is the optimal lambda chosen through cross-validation for the elastic net for predicting Y with Y.

Author(s)

Sebastian Jose Teran Hidalgo and Shuangge Ma. Maintainer: Sebastian Jose Teran Hidalgo. sebastianteranhidalgo@gmail.com.

References

Hidalgo, Sebastian J. Teran, Mengyun Wu, and Shuangge Ma. Assisted clustering of gene expression data using ANCut. BMC genomics 18.1 (2017): 623.

Examples

#This sets up the initial parameters for the simulation.
library(MASS)#for mvrnorm
library(fields)
n=30 #Sample size
B=50 #Number of iterations in the simulated annealing algorithm.
L=10000 #Temperature coefficient.
p=50 #Number of columns of Y.
q=p #Number of columns of X.
h1=0.15
h2=0.25

S=matrix(0.2,q,q)
S[1:(q/2),(q/2+1):q]=0
S[(q/2+1):q,1:(q/2)]=0
S=S-diag(diag(S))+diag(q)

mu=rep(0,q)

W0=matrix(1,p,p)
W0[1:(p/2),1:(p/2)]=0
W0[(p/2+1):p,(p/2+1):p]=0
Denum=sum(W0)

B2=matrix(0,q,p)
for (i in 1:(p/2)){
   B2[1:(q/2),i]=runif(q/2,h1,h2)
   in1=sample.int(q/2,6)
   B2[-in1,i]=0
}

for (i in (p/2+1):p){
   B2[(q/2+1):q,i]=runif(q/2,h1,h2)
   in2=sample(seq(q/2+1,q),6)
   B2[-in2,i]=0
}

X=mvrnorm(n, mu, S)
Z=X%*%B2
Y=Z+matrix(rnorm(n*p,0,1),n,p)
#Our method
Res=ancut(Y=Y,X=X,B=B,L=L,alpha=0,ncv=3)
Cx=Res[[2]]
f11=matrix(Cx[,1],p,1)
f12=matrix(Cx[,2],p,1)

errorL=sum((f11%*%t(f11))*W0)/Denum+sum((f12%*%t(f12))*W0)/Denum
#This is the true error of the clustering solution.
errorL

par(mfrow=c(1,2))
#Below is a plot of the simulated annealing path.
plot(Res[[1]],type='l',ylab='')
#Cluster found by ANCut
image.plot(Cx)

[Package NCutYX version 0.1.0 Index]