R: Perform statistical analysis of tightness for branches of a...

SigTree {TBEST}

R Documentation

Perform statistical analysis of tightness for branches of a hierarchical cluster.

Description

Description: Given data from which a hierarchical tree is grown, compute measures of tightness for each branch, sample from the null distribution of these measures in the randomized data and compute the corresponding p-values.

Usage

SigTree(myinput,mystat=c("all","fldc","bldc","fldcc","slb"),
        mymethod="complete",mymetric="euclidean",rand.fun=NA,
        by.block=NA,distrib=c("vanilla","Rparallel"),Ptail=TRUE,
        tailmethod=c("ML","MOM"),njobs=1,seed=NA,
        Nperm=ifelse(Ptail,1000,1000*nrow(myinput)),
        metric.args=list(),rand.args=list())

Arguments

`myinput`	A matrix with rows corresponding to items to be clustered.
`mystat`	A character string specifying the measures of tightness to be computed and evaluated for significance of finding. See Details for the definitions of these measures. If `"all"` is chosen, all the first three measures, `"fldc"`, `"bldc"` and `"fldcc"`, and the corresponding p-values are computed. Otherwise, only the specified measure and its p-value are computed.
`mymethod`	A character string specifying the linkage method for hierarchical clustering, to be used by the `hclust` function. See `hclust` argument `method` for method options.
`mymetric`	A character string specifying the definition of dissimilarity (distance) among the data items. The options, in addition to those for the argument `method` of the `dist` functiton, are `"pearson"`,`"kendall"`, and `"spearman"`. If one of the latter three is chosen, the distances are computed as `as.dist(1 -` `cor(myinput))`, with the corresponding option for the `method` argument of the `cor` function.It can also be a character string specifying a user supplied dissimilarity (distance) function for `myinput`. See `details` and `examples` below for further explanation.
`rand.fun`	A character string specifying the permutation method to be applied to `myinput`. If NA(default), no permutation is performed. `"shuffle.column"` performs a random permutation independently within each column. With `"shuffle.block"`, a random permutation is performed independently within each block of columns, as specified by the `by.block` argument, and independently from the other blocks. It can also be a character string specifying a user supplied randomization function for `myinput`. See `details` and `examples` below for further explanation.
`by.block`	A vector of the same length as the column dimension of `myinput`, to specify the blocking of columns of `myinput`. It is used in conjunction with `rand.fun` `= "shuffle.block"`, and is ignored otherwise.
`distrib`	One of `"vanilla", "Rparallel"` to specify the distributed computing option for the cluster assignment step. For `"vanilla"` (default) no distributed computing is performed. For `"Rparallel"` the `parallel` package of `R` core is used for multi-core processing.
`Ptail`	Logical. If `Ptail` is TRUE(default), the Generalized Pareto Distribution is used to approximate the tail of the null distribution for each of the chosen measures. Otherwise, empirical p-values are computed directly from the corresponding samples.
`tailmethod`	A character string only needed to be specified if the `Ptail` is set to TRUE. For `"ML"` the parameters of the Generalized Pareto Distribution are estimated by likelihood maximization; for `"MOM"` they are estimated by the method of moments.
`njobs`	A single integer specifying the number of worker jobs to create in case of distributed computation if `distrib = "Rparallel"`; ignored otherwise.
`seed`	An optional single integer value, to be used to set the random number generator seed (see `details`).
`Nperm`	A single integer specifying the size of a sample from the null distribution. See `details` for the default sample size.
`metric.args`	Additional arguments for user-supplied dissimilarity (distance) function. See `details` and `examples` below for further explanation.
`rand.args`	Additional arguments for user-supplied randomization function. See `details` and `examples` below for further explanation.

Details

When rand.fun is set to the name of a user supplied randomization function, the first argument of that function should be set to myinput. See examples below.

The measures of tightness are defined as follows. Denote a node in the tree by a, its sibling node by b, and their parent node by p. Let their respective geights be ha,hb,hp. Finally, let Sx mean that the measure S is computed for the node x. Then the definitions are

fldc:

Sa = (hp-ha)/hp

fldcc:

Sa = (hp-(ha-hb)/2)/ha

bldc:

Sp = (2*hp-ha-hb)/(2*hp)

slb:

Sp = 2*hp-ha-hb

The first three measures test tightnss of all internal nodes at the same time, while slb only tests two-way split of input data. The seed argument is optional. Setting the seed ensures reproducibility of sampling from the null distribution.

Value

If rand.fun is set to NA, the function returns a matrix whose rows correspond to the internal nodes of the tree and whose columns contain the tree structure as in the merge component of the class hclust; the height component of hclust;and columns tabulating the values of the measures of tightness specified by the mystat argument. If rand.fun is set to a specific randomization method, an object of class best is returned. See ?best for details.

Note

If mymetric or rand.fun is a customized function, make sure you have read and write permission for your working directory.

Author(s)

Guoli Sun, Alex Krasnitz

References

Theo A. Knijnenburg, Lodewyk F. A. Wessels et al (2009) Fewer permutations, more accurate P-values

Examples

####Each column is a gene expression profile for a case of leukemia. 
####Each case belongs to one of three subtypes.
data(leukemia)
#output only statistic table
mytable<-SigTree(data.matrix(leukemia),mystat="all",
        mymethod="ward",mymetric="euclidean")
class(mytable)
## Not run: 
#use multicore processing to detect significant sub-clusters
mytable<-SigTree(data.matrix(leukemia),mystat="all",
	mymethod="ward",mymetric="euclidean",rand.fun="shuffle.column",
	distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="ML")
class(mytable)
####Each row after the 1st describes an item belonging to one of four subtypes. 
####Each column corresponds to a genomic location in one of 22 human chromosomes. 
####The 1st row contains the chromosome numbers.
data(T10)
#Perform randomization within each chromosome
chrom<-as.numeric(T10[1,])
mydata<-T10[-1,] 
mytable<-SigTree(data.matrix(mydata),mystat="fldc",        
	mymethod="ward",mymetric="euclidean",rand.fun="shuffle.block",
	by.block=chrom,distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="ML")
#Compute dissimilarity using a user-supplied distance function,
#and perform randomization using a user-supplied randomization function, 
#with additional arguments. 
#Both user-supplied functions are only useful as illustration.
mydist<-function(x,y){return(dist(x)/y)}
myrand<-function(x,z){return(apply(x+z,2,sample))}
mytable<-SigTree(data.matrix(leukemia),mystat="fldc",
mymethod="ward",mymetric="mydist",rand.fun="myrand",
distrib="Rparallel",njobs=2,Ptail=TRUE,tailmethod="MOM",metric.args=list(3),
rand.args=list(2))

## End(Not run)

[Package TBEST version 5.2 Index]