sigclust {sigclust}R Documentation

Statistical Significance of Clustering

Description

Perform a significance analysis of clustering. SigClust studies whether clusters are really there, using the 2-means (k = 2) clustering index as a statistic. It assesses the significance of clustering by simulation from a single null Gaussian distribution. Null Gaussian parameters are estimated from the data.

Usage

sigclust(x, nsim, nrep=1, labflag=0, label=0, icovest=1)

Arguments

x

A matrix or data.frame of expression data; each row corresponds to a sample and each column to a variable. Data may be properly normalized and may not contain missing values.

nsim

Number of simulated Gaussian samples to estimate the distribution of the clustering index for the main p-value computation.

nrep

Number of steps to use in 2-means clustering computations (default=1, chosen to optimize speed). This has no effect, unless labflag=0.

labflag

An indicator variable specifying if the p-values is for an assigned cluster or for using 2-means; for user assigned clusters labflag=1, otherwise labflag=0.

label

If labflag=0, SigClust uses labels generated by 2-means clustering. If labflag=1, label needs to be set as a numeric, integer vector of 1s and 2s with length nrow(x) which indicates given cluster labels (grouping to be tested for significance).

icovest

Covariance estimation type: 1. Use a soft threshold method as constrained MLE (default); 2. Use sample covariance estimate (recommended when diagnostics fail); 3. Use original background noise thresholded estimate (from Liu, et al, (2008)) ("hard thresholding").

Details

The SigClust method addresses the problem of assessing statistical significance of clustering as a testing procedure. The null hypothesis of SigClust is that the data are from a single Gaussian distribution. The signicance of a given clustering is judged by calculating an appropriate p-value. The SigClust method uses a test statistic called the cluster index (CI) which is defined to be the sum of within-class sums of squares about the mean divided by the total sum of squares about the overall mean. The null distribution of the CI can be approximated by simulating from a single Gaussian distribution, fit to the data. Because CI is mean shift invariant, it is enough to take the mean to be 0. Because CI is rotation invariant, we take the covariance to be diagonal. There are three options for estimating the eigenvalues of the covariance matrix: 1. Soft Thresholding (recommended for high dimensions, when the diagnostics indicate assumptions are met). 2. Sample eigenvalues (recommended for low dimensions, and when assumptions, such as Gaussianity fail, but known to be generally conservative). 3. Hard Thresholding.

Value

The function returns an object of class sigclust. See help for sigclust-class for more details.

Author(s)

Hanwen Huang: hanwenh@email.unc.edu; Yufeng Liu: yfliu@email.unc.edu; J. S. Marron: marron@email.unc.edu

References

Liu, Yufeng, Hayes, David Neil, Nobel, Andrew and Marron, J. S, 2008, Statistical Significance of Clustering for High-Dimension, Low-Sample Size Data, Journal of the American Statistical Association 103(483) 1281–1293.

See Also

plot-methods.

Examples

## Simulate a dataset from a collection of mixtures of two
## multivariate Gaussian distribution with different means.

mu <- 5
n <- 30
p <- 500
dat <- matrix(rnorm(p*2*n),2*n,p)
dat[1:n,1] <- dat[1:n,1]+mu
dat[(n+1):(2*n),1] <- dat[(n+1):(2*n),1]-mu

nsim <- 1000
nrep <- 1
icovest <- 3
pvalue <- sigclust(dat,nsim=nsim,nrep=nrep,labflag=0,icovest=icovest)
#sigclust plot
plot(pvalue)


[Package sigclust version 1.1.0.1 Index]