ancut {NCutYX} | R Documentation |
Cluster the Columns of Y into K Groups with the Help of External Features X.
Description
This function will output K clusters of the columns of Y using the help of X.
Usage
ancut(Y, X, K = 2, B = 3000, L = 1000, alpha = 0.5, nlambdas = 100,
sampling = "equal", ncv = 5, dist = "correlation", sigma = 0.1)
Arguments
Y |
is a n x p matrix of p variables and n observations. The columns of Y will be clustered into K groups. |
X |
is a n x q matrix of q variables and n observations. |
K |
is the number of clusters. |
B |
is the number of iterations in the simulated annealing algorithm. |
L |
is the temperature coefficient in the simulated annealing algorithm. |
alpha |
is the coefficient of the elastic net penalty. |
nlambdas |
is the number of tuning parameters in the elastic net. |
sampling |
if 'equal' then the sampling probabilities is the same during the simulated annealing algorithm, if 'size' the probabilites are proportional the the sizes of the clusters in the current iterations. |
ncv |
is the number of cross-validations in the elastic net. |
dist |
is the type of distance metric for the construction of the similarity matrix. Options are 'gaussian', 'euclidean' and 'correlation', the latter being the default. |
sigma |
is the parameter for the gaussian kernel distance which is ignored if 'gaussian' is not chosen as distance measure. |
Details
The algorithm minimizes a modified version of NCut through simulated annealing.
The modified NCut uses in the numerator the similarity matrix of the original data Y
and the denominator uses the similarity matrix of the prediction of Y
using X
.
The clusters correspond to partitions that minimize this objective function.
The external information of X
is incorporated by using elastic net to predict Y
.
Value
A list with the final value of the objective function, the clusters and the lambda penalty chosen through cross-validation.
A list with the following components:
- loss
a vector of length
N
which contains the loss at each iteration of the simulated annealing algorithm.- cluster
a matrix representing the clustering result of dimension
p
timesK
, wherep
is the number of columns ofY
.- lambda.min
is the optimal lambda chosen through cross-validation for the elastic net for predicting
Y
withY
.
Author(s)
Sebastian Jose Teran Hidalgo and Shuangge Ma. Maintainer: Sebastian Jose Teran Hidalgo. sebastianteranhidalgo@gmail.com.
References
Hidalgo, Sebastian J. Teran, Mengyun Wu, and Shuangge Ma. Assisted clustering of gene expression data using ANCut. BMC genomics 18.1 (2017): 623.
Examples
#This sets up the initial parameters for the simulation.
library(MASS)#for mvrnorm
library(fields)
n=30 #Sample size
B=50 #Number of iterations in the simulated annealing algorithm.
L=10000 #Temperature coefficient.
p=50 #Number of columns of Y.
q=p #Number of columns of X.
h1=0.15
h2=0.25
S=matrix(0.2,q,q)
S[1:(q/2),(q/2+1):q]=0
S[(q/2+1):q,1:(q/2)]=0
S=S-diag(diag(S))+diag(q)
mu=rep(0,q)
W0=matrix(1,p,p)
W0[1:(p/2),1:(p/2)]=0
W0[(p/2+1):p,(p/2+1):p]=0
Denum=sum(W0)
B2=matrix(0,q,p)
for (i in 1:(p/2)){
B2[1:(q/2),i]=runif(q/2,h1,h2)
in1=sample.int(q/2,6)
B2[-in1,i]=0
}
for (i in (p/2+1):p){
B2[(q/2+1):q,i]=runif(q/2,h1,h2)
in2=sample(seq(q/2+1,q),6)
B2[-in2,i]=0
}
X=mvrnorm(n, mu, S)
Z=X%*%B2
Y=Z+matrix(rnorm(n*p,0,1),n,p)
#Our method
Res=ancut(Y=Y,X=X,B=B,L=L,alpha=0,ncv=3)
Cx=Res[[2]]
f11=matrix(Cx[,1],p,1)
f12=matrix(Cx[,2],p,1)
errorL=sum((f11%*%t(f11))*W0)/Denum+sum((f12%*%t(f12))*W0)/Denum
#This is the true error of the clustering solution.
errorL
par(mfrow=c(1,2))
#Below is a plot of the simulated annealing path.
plot(Res[[1]],type='l',ylab='')
#Cluster found by ANCut
image.plot(Cx)