sigclust {sigclust} | R Documentation |
Statistical Significance of Clustering
Description
Perform a significance analysis of clustering. SigClust studies whether clusters are really there, using the 2-means (k = 2) clustering index as a statistic. It assesses the significance of clustering by simulation from a single null Gaussian distribution. Null Gaussian parameters are estimated from the data.
Usage
sigclust(x, nsim, nrep=1, labflag=0, label=0, icovest=1)
Arguments
x |
A matrix or data.frame of expression data; each row corresponds to a sample and each column to a variable. Data may be properly normalized and may not contain missing values. |
nsim |
Number of simulated Gaussian samples to estimate the distribution of the clustering index for the main p-value computation. |
nrep |
Number of steps to use in 2-means clustering computations (default=1, chosen to optimize speed). This has no effect, unless labflag=0. |
labflag |
An indicator variable specifying if the p-values is for an assigned cluster or for using 2-means; for user assigned clusters labflag=1, otherwise labflag=0. |
label |
If
labflag=0, SigClust uses labels generated by 2-means clustering. If
labflag=1, label needs to be set as a numeric, integer vector of 1s and
2s with length |
icovest |
Covariance estimation type: 1. Use a soft threshold method as constrained MLE (default); 2. Use sample covariance estimate (recommended when diagnostics fail); 3. Use original background noise thresholded estimate (from Liu, et al, (2008)) ("hard thresholding"). |
Details
The SigClust method addresses the problem of assessing
statistical significance of clustering as a testing procedure. The
null hypothesis of SigClust
is that the data are from a single
Gaussian distribution. The signicance of a given clustering is judged
by calculating an appropriate p-value. The SigClust method uses a test
statistic called the cluster index (CI) which is defined to be the sum
of within-class sums of squares about the mean divided by the total
sum of squares about the overall mean. The null distribution of the CI
can be approximated by simulating from a single Gaussian distribution,
fit to the data. Because CI is mean shift invariant, it is enough to
take the mean to be 0. Because CI is rotation invariant, we take the
covariance to be diagonal. There are three options for estimating the
eigenvalues of the covariance matrix: 1. Soft Thresholding
(recommended for high dimensions, when the diagnostics indicate
assumptions are met). 2. Sample eigenvalues (recommended for low
dimensions, and when assumptions, such as Gaussianity fail, but known
to be generally conservative). 3. Hard Thresholding.
Value
The function returns an object of class sigclust
. See
help for sigclust-class
for more details.
Author(s)
Hanwen Huang: hanwenh@email.unc.edu; Yufeng Liu: yfliu@email.unc.edu; J. S. Marron: marron@email.unc.edu
References
Liu, Yufeng, Hayes, David Neil, Nobel, Andrew and Marron, J. S, 2008, Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data, Journal of the American Statistical Association 103(483) 1281–1293.
See Also
Examples
## Simulate a dataset from a collection of mixtures of two
## multivariate Gaussian distribution with different means.
mu <- 5
n <- 30
p <- 500
dat <- matrix(rnorm(p*2*n),2*n,p)
dat[1:n,1] <- dat[1:n,1]+mu
dat[(n+1):(2*n),1] <- dat[(n+1):(2*n),1]-mu
nsim <- 1000
nrep <- 1
icovest <- 3
pvalue <- sigclust(dat,nsim=nsim,nrep=nrep,labflag=0,icovest=icovest)
#sigclust plot
plot(pvalue)