runClue {ClueR} | R Documentation |
Run CLUster Evaluation
Description
Takes in a time-course matrix and test for enrichment of the clustering using cmeans or kmeans clustering algorithm with a reference annotation.
Usage
runClue(
Tc,
annotation,
rep = 5,
kRange = 2:10,
clustAlg = "cmeans",
effectiveSize = c(5, 100),
pvalueCutoff = 0.05,
alpha = 0.5,
standardise = TRUE,
universe = NULL
)
Arguments
Tc |
a numeric matrix to be clustered. The columns correspond to the time-course and the rows correspond to phosphorylation sites. |
annotation |
a list with names correspond to kinases and elements correspond to substrates belong to each kinase. |
rep |
number of times the clustering is to be applied. This is to account for variability in the clustering algorithm. Default is 5. |
kRange |
the range of k to be tested for clustering. Default is 2:10 |
clustAlg |
the clustering algorithm to be used. The default is cmeans clustering. |
effectiveSize |
the size of annotation groups to be considered for calculating enrichment. Groups that are too small or too large will be removed from calculating overall enrichment of the clustering. |
pvalueCutoff |
a pvalue cutoff for determining which kinase-substrate groups to be included in calculating overall enrichment of the clustering. |
alpha |
a regularisation factor for penalizing large number of clusters. |
standardise |
whether to z-score standardise the input matrix. |
universe |
the universe of genes/proteins/phosphosites etc. that the enrichment is calculated against. The default are the row names of the dataset. |
Value
a clue output that contains the input parameters used for evaluation and the evaluation results. Use ls(x) to see details of output. 'x' be the output here.
Examples
## Example 1. Running CLUE with a simulated phosphoproteomics data
## simulate a time-series phosphoproteomics data with 4 clusters and
## each cluster with a size of 100 phosphosites
simuData <- temporalSimu(seed=1, groupSize=100, sdd=1, numGroups=4)
## create an artificial annotation database. Specifically, Generate 50
## kinase-substrate groups each comprising 20 substrates assigned to a kinase.
## Among them, create 5 groups each contains phosphosites defined
## to have the same temporal profile.
kinaseAnno <- list()
groupSize <- 100
for (i in 1:5) {
kinaseAnno[[i]] <- paste("p", (groupSize*(i-1)+1):(groupSize*(i-1)+20), sep="_")
}
for (i in 6:50) {
set.seed(i)
kinaseAnno[[i]] <- paste("p", sample.int(nrow(simuData), size = 20), sep="_")
}
names(kinaseAnno) <- paste("KS", 1:50, sep="_")
## run CLUE with a repeat of 3 times and a range from 2 to 8
set.seed(1)
cl <- runClue(Tc=simuData, annotation=kinaseAnno, rep=3, kRange=2:8,
standardise = TRUE, universe = NULL)
## visualize the evaluation outcome
boxplot(cl$evlMat, col=rainbow(8), las=2, xlab="# cluster", ylab="Enrichment", main="CLUE")
## generate optimal clustering results using the optimal k determined by CLUE
best <- clustOptimal(cl, rep=3, mfrow=c(2, 3))
## list enriched clusters
best$enrichList
## obtain the optimal clustering object
best$clustObj
## Example 2. Running CLUE with a phosphoproteomics dataset, discover optimal number of clusters,
## clustering data accordingly, and identify key kinases involved in each cluster.
## load the human ES phosphoprotoemics data (Rigbolt et al. Sci Signal. 4(164):rs3, 2011)
data(hES)
# load the PhosphoSitePlus annotations (Hornbeck et al. Nucleic Acids Res. 40:D261-70, 2012)
# note that one can instead use PhosphoELM database by typing "data(PhosphoELM)".
data(PhosphoSite)
## run CLUE with a repeat of 5 times and a range from 2 to 15
set.seed(1)
cl <- runClue(Tc=hES, annotation=PhosphoSite.human, rep=5, kRange=2:15,
standardise = TRUE, universe = NULL)
boxplot(cl$evlMat, col=rainbow(15), las=2, xlab="# cluster", ylab="Enrichment", main="CLUE")
best <- clustOptimal(cl, rep=3, mfrow=c(4, 4))
best$enrichList
## Example 3. Running CLUE with a gene expression dataset, discover optimal number of clusters,
## clustering data accordingly, and identify key pathway involved in each cluster.
## load mouse adipocyte gene expression data
# (Ma et al. Molecular and Cellular Biology. 2014, 34(19):3607-17)
data(adipocyte)
## load the KEGG annotations
## note that one can instead use reactome, GOBP, biocarta database
data(Pathways)
## select genes that are differentially expressed during adipocyte differentiation
adipocyte.selected <- adipocyte[adipocyte[,"DE"] == 1,]
## run CLUE with a repeat of 5 times and a range from 10 to 22
set.seed(3)
cl <- runClue(Tc=adipocyte.selected, annotation=Pathways.KEGG, rep=3, kRange=10:20,
standardise = TRUE, universe = NULL)
xl <- "Number of clusters"
yl <- "Enrichment score"
boxplot(cl$evlMat, col=rainbow(ncol(cl$evlMat)), las=2, xlab=xl, ylab=yl, main="CLUE")