pvclust {pvclust} | R Documentation |
Calculating P-values for Hierchical Clustering
Description
calculates p
-values for hierarchical clustering via
multiscale bootstrap resampling. Hierarchical clustering is done for
given data and p
-values are computed for each of the clusters.
Usage
pvclust(data, method.hclust="average",
method.dist="correlation", use.cor="pairwise.complete.obs",
nboot=1000, parallel=FALSE, r=seq(.5,1.4,by=.1),
store=FALSE, weight=FALSE, iseed=NULL, quiet=FALSE)
parPvclust(cl=NULL, data, method.hclust="average",
method.dist="correlation", use.cor="pairwise.complete.obs",
nboot=1000, r=seq(.5,1.4,by=.1), store=FALSE, weight=FALSE,
init.rand=NULL, iseed=NULL, quiet=FALSE)
Arguments
data |
numeric data matrix or data frame. |
method.hclust |
the agglomerative method used in hierarchical clustering. This
should be (an abbreviation of) one of |
method.dist |
the distance measure to be used. This should be
a character string, or a function which returns a |
use.cor |
character string which specifies the method for
computing correlation with data including missing values. This
should be (an abbreviation of) one of |
nboot |
the number of bootstrap replications. The default is
|
parallel |
switch for parallel computation.
If |
r |
numeric vector which specifies the relative sample sizes of
bootstrap replications. For original sample size |
store |
locical. If |
cl |
a cluster object created by package parallel or snow. If NULL, use the registered default cluster. |
weight |
logical. If |
init.rand |
logical. If |
iseed |
An integer. If non- |
quiet |
logical. If |
Details
Function pvclust
conducts multiscale bootstrap resampling to calculate
p
-values for each cluster in the result of hierarchical
clustering. parPvclust
is the parallel version of this
procedure which depends on package parallel for parallel computation.
For data expressed as (n \times p)
matrix or data frame, we
assume that the data is n
observations of p
objects, which
are to be clustered. The i
'th row vector corresponds to the
i
'th observation of these objects and the j
'th column
vector corresponds to a sample of j
'th object with size n
.
There are several methods to measure the dissimilarities between
objects. For data matrix X=\{x_{ij}\}
,
"correlation"
method takes
1 - \frac{
\sum_{i=1}^n (x_{ij} - \bar{x}_j) (x_{ik} - \bar{x}_k)
}
{
\sqrt{\sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}
\sqrt{\sum_{i=1}^n (x_{ik} - \bar{x}_k)^2}
}
for dissimilarity between j
'th and k
'th object, where
\bar{x}_j = \frac{1}{n} \sum_{i=1}^n x_{ij} \mbox{and}
\bar{x}_k = \frac{1}{n} \sum_{i=1}^n x_{ik}
.
"uncentered"
takes uncentered sample correlation
1 - \frac{
\sum_{i=1}^n x_{ij} x_{ik}
}
{
\sqrt{\sum_{i=1}^n x_{ij}^2}
\sqrt{\sum_{i=1}^n x_{ik}^2}
}
and "abscor"
takes the absolute value of sample correlation
1 - \ \Biggl| \frac{
\sum_{i=1}^n (x_{ij} - \bar{x}_j) (x_{ik} - \bar{x}_k)
}
{
\sqrt{\sum_{i=1}^n (x_{ij} - \bar{x}_j)^2}
\sqrt{\sum_{i=1}^n (x_{ik} - \bar{x}_k)^2}
} \Biggl|.
Value
hclust |
hierarchical clustering for original data generated by
function |
edges |
data frame object which contains |
count |
data frame object which contains primitive information about the result of multiscale bootstrap resampling. |
msfit |
list whose elements are results of curve fitting for
multiscale bootstrap resampling, of class |
nboot |
numeric vector of number of bootstrap replications. |
r |
numeric vector of the relative sample size for bootstrap replications. |
store |
list contains bootstrap replications if |
version |
|
Author(s)
Ryota Suzuki suzuki@ef-prime.com
References
Suzuki, R. and Shimodaira, H. (2006) "Pvclust: an R package for assessing the uncertainty in hierarchical clustering", Bioinformatics, 22 (12): 1540-1542.
Shimodaira, H. (2004) "Approximately unbiased tests of regions using multistep-multiscale bootstrap resampling", Annals of Statistics, 32, 2616-2641.
Shimodaira, H. (2002) "An approximately unbiased test of phylogenetic tree selection", Systematic Biology, 51, 492-508.
Suzuki, R. and Shimodaira, H. (2004) "An application of multiscale bootstrap resampling to hierarchical clustering of microarray data: How accurate are these clusters?", The Fifteenth International Conference on Genome Informatics 2004, P034.
http://www.sigmath.es.osaka-u.ac.jp/shimo-lab/prog/pvclust/
See Also
lines.pvclust
, print.pvclust
,
msfit
, plot.pvclust
,
text.pvclust
, pvrect
and
pvpick
.
Examples
### example using Boston data in package MASS
data(Boston, package = "MASS")
## multiscale bootstrap resampling (non-parallel)
boston.pv <- pvclust(Boston, nboot=100, parallel=FALSE)
## CAUTION: nboot=100 may be too small for actual use.
## We suggest nboot=1000 or larger.
## plot/print functions will be useful for diagnostics.
## plot dendrogram with p-values
plot(boston.pv)
ask.bak <- par()$ask
par(ask=TRUE)
## highlight clusters with high au p-values
pvrect(boston.pv)
## print the result of multiscale bootstrap resampling
print(boston.pv, digits=3)
## plot diagnostic for curve fitting
msplot(boston.pv, edges=c(2,4,6,7))
par(ask=ask.bak)
## print clusters with high p-values
boston.pp <- pvpick(boston.pv)
boston.pp
### Using a custom distance measure
## Define a distance function which returns an object of class "dist".
## The function must have only one argument "x" (data matrix or data.frame).
cosine <- function(x) {
x <- as.matrix(x)
y <- t(x) %*% x
res <- 1 - y / (sqrt(diag(y)) %*% t(sqrt(diag(y))))
res <- as.dist(res)
attr(res, "method") <- "cosine"
return(res)
}
result <- pvclust(Boston, method.dist=cosine, nboot=100)
plot(result)
## Not run:
### parallel computation
result.par <- pvclust(Boston, nboot=1000, parallel=TRUE)
plot(result.par)
## End(Not run)