SigTree {TBEST} | R Documentation |
Perform statistical analysis of tightness for branches of a hierarchical cluster.
Description
Description: Given data from which a hierarchical tree is grown, compute measures of tightness for each branch, sample from the null distribution of these measures in the randomized data and compute the corresponding p-values.
Usage
SigTree(myinput,mystat=c("all","fldc","bldc","fldcc","slb"),
mymethod="complete",mymetric="euclidean",rand.fun=NA,
by.block=NA,distrib=c("vanilla","Rparallel"),Ptail=TRUE,
tailmethod=c("ML","MOM"),njobs=1,seed=NA,
Nperm=ifelse(Ptail,1000,1000*nrow(myinput)),
metric.args=list(),rand.args=list())
Arguments
myinput |
A matrix with rows corresponding to items to be clustered. |
mystat |
A character string specifying the measures of tightness to be computed and evaluated for significance of finding. See Details for the definitions of these measures. If |
mymethod |
A character string specifying the linkage method for hierarchical clustering, to be used by the |
mymetric |
A character string specifying the definition of dissimilarity (distance) among the data items. The options, in addition to those for the argument |
rand.fun |
A character string specifying the permutation method to be applied to |
by.block |
A vector of the same length as the column dimension of |
distrib |
One of |
Ptail |
Logical. If |
tailmethod |
A character string only needed to be specified if the |
njobs |
A single integer specifying the number of worker jobs to create in case of distributed computation if |
seed |
An optional single integer value, to be used to set the random number generator seed (see |
Nperm |
A single integer specifying the size of a sample from the null distribution. See |
metric.args |
Additional arguments for user-supplied dissimilarity (distance) function. See |
rand.args |
Additional arguments for user-supplied randomization function. See |
Details
When rand.fun
is set to the name of a user supplied randomization function, the first argument of that function should be set to myinput
. See examples below.
The measures of tightness are defined as follows. Denote a node in the tree by a
, its sibling node by b
, and their parent node by p
. Let their respective geights be ha
,hb
,hp
. Finally, let Sx
mean that the measure S
is computed for the node x
. Then the definitions are
fldc:
Sa = (hp-ha)/hp
fldcc:
Sa = (hp-(ha-hb)/2)/ha
bldc:
Sp = (2*hp-ha-hb)/(2*hp)
slb:
Sp = 2*hp-ha-hb
The first three measures test tightnss of all internal nodes at the same time, while slb
only tests two-way split of input data.
The seed
argument is optional. Setting the seed ensures reproducibility of sampling from the null distribution.
Value
If rand.fun
is set to NA, the function returns a matrix whose rows correspond to the internal nodes of the tree and whose columns contain the tree structure as in the merge
component of the class hclust
; the height
component of hclust
;and columns tabulating the values of the measures of tightness specified by the mystat
argument.
If rand.fun
is set to a specific randomization method, an object of class best
is returned. See ?best
for details.
Note
If mymetric
or rand.fun
is a customized function, make sure you have read and write permission for your working directory.
Author(s)
Guoli Sun, Alex Krasnitz
References
Theo A. Knijnenburg, Lodewyk F. A. Wessels et al (2009) Fewer permutations, more accurate P-values
See Also
Examples
####Each column is a gene expression profile for a case of leukemia.
####Each case belongs to one of three subtypes.
data(leukemia)
#output only statistic table
mytable<-SigTree(data.matrix(leukemia),mystat="all",
mymethod="ward",mymetric="euclidean")
class(mytable)
## Not run:
#use multicore processing to detect significant sub-clusters
mytable<-SigTree(data.matrix(leukemia),mystat="all",
mymethod="ward",mymetric="euclidean",rand.fun="shuffle.column",
distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="ML")
class(mytable)
####Each row after the 1st describes an item belonging to one of four subtypes.
####Each column corresponds to a genomic location in one of 22 human chromosomes.
####The 1st row contains the chromosome numbers.
data(T10)
#Perform randomization within each chromosome
chrom<-as.numeric(T10[1,])
mydata<-T10[-1,]
mytable<-SigTree(data.matrix(mydata),mystat="fldc",
mymethod="ward",mymetric="euclidean",rand.fun="shuffle.block",
by.block=chrom,distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="ML")
#Compute dissimilarity using a user-supplied distance function,
#and perform randomization using a user-supplied randomization function,
#with additional arguments.
#Both user-supplied functions are only useful as illustration.
mydist<-function(x,y){return(dist(x)/y)}
myrand<-function(x,z){return(apply(x+z,2,sample))}
mytable<-SigTree(data.matrix(leukemia),mystat="fldc",
mymethod="ward",mymetric="mydist",rand.fun="myrand",
distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="MOM",metric.args=list(3),
rand.args=list(2))
## End(Not run)