BiBitWorkflow {BiBitR} | R Documentation |
BiBit Workflow
Description
Workflow to discover larger (noisy) patterns in big data using BiBit
Usage
BiBitWorkflow(matrix, minr = 2, minc = 2, similarity_type = "col",
func = "agnes", link = "average", par.method = 0.625,
cut_type = "gap", cut_pm = "Tibs2001SEmax", gap_B = 500,
gap_maxK = 50, noise = 0.1, noise_select = 0, plots = c(3:5),
BCresult = NULL, simmatresult = NULL, treeresult = NULL,
plot.type = "device", filename = "BiBitWorkflow", verbose = TRUE)
Arguments
matrix |
The binary input matrix. |
minr |
The minimum number of rows of the Biclusters. |
minc |
The minimum number of columns of the Biclusters. |
similarity_type |
Which dimension to use for the Jaccard Index in Step 2. This is either columns ( |
func |
Which clustering function to use in Step 3. Either |
link |
Which clustering link to use in Step 3. The available links (depending on
|
par.method |
Additional parameters used for flexible link (See |
cut_type |
Which method should be used to decide the number of clusters in the tree in Step 4?
|
cut_pm |
Cut Parameter (depends on
|
gap_B |
Number of bootstrap samples (default=500) for Gap Statistic ( |
gap_maxK |
Number of clusters to consider (default=50) for Gap Statistic ( |
noise |
The allowed noise level when growing the rows on the merged patterns in Step 6. (default=
|
noise_select |
Should the allowed noise level be automatically selected for each pattern? (Using ad hoc method to find the elbow/kink in the Noise Scree plots)
|
plots |
Vector for which plots to draw:
|
BCresult |
Import a BiBit Biclust result for Step 1 (e.g. extract from an older BiBitWorkflow object |
simmatresult |
Import a (custom) Similarity Matrix (e.g. extract from older BiBitWorkflow object |
treeresult |
Import a (custom) tree ( |
plot.type |
Output Type
|
filename |
Base filename (with/without directory) for the plots if |
verbose |
Logical value if progress of workflow should be printed. |
Details
Looking for Noisy Biclusters in large data using BiBit (bibit2
) often results in many (overlapping) biclusters.
In order decrease the number of biclusters and find larger meaningful patterns which make up noisy biclusters, the following workflow can be applied.
Note that this workflow is primarily used for data where there are many more rows (e.g. patients) than columns (e.g. symptoms). For example the workflow would discover larger meaningful symptom patterns which, conditioned on the allowed noise/zeros, subsets of the patients share.
Apply BiBit with no noise (Preferably with high enough
minr
andminc
).Compute Similarity Matrix (Jaccard Index) of all biclusters. By default this measure is only based on column similarity. This implies that the rows of the BC's are not of interest in this step. The goal then would be to discover highly overlapping column patterns and, in the next steps, merge them together.
Apply Agglomerative Hierarchical Clustering on Similarity Matrix (default = average link)
Cut the dendrogram of the clustering result and merge the biclusters based on this. (default = number of clusters is determined by the Tibs2001SEmax Gap Statistic)
Extract Column Memberships of the Merged Biclusters. These are saved as the new column Patterns.
Starting from these patterns, (noisy) rows are grown which match the pattern, creating a single final bicluster for each pattern. At the end duplicate/non-maximal BC's are deleted.
Using the described workflow (and column similarity in Step 2), the final result will contain biclusters which focus on larger column patterns.
Value
A BiBitWorkflow S3 List Object with 3 slots:
-
Biclust
: Biclust Class Object of Final Biclustering Result (after Step 6). -
BiclustSim
: Jaccard Index Similarity Matrix of Final Biclustering Result (after Step 6). -
info
: List Object containing:-
BiclustInitial
: Biclust Class Object of Initial Biclustering Result (after Step 1). -
BiclustSimInitial
: Jaccard Index Similarity Matrix of Initial Biclustering Result (after Step 1). -
Tree
: Hierarchical Tree ofBiclustSimInitial
ashclust
object. -
Number
: Vector containing the initial number of biclusters (InitialNumber
), the number of saved patterns after cutting the tree (PatternNumber
) and the final number of biclusters (FinalNumber
). -
GapStat
: Vector containing all different optimal cluster numbers based on the Gap Statistic. -
BC.Merge
: A list (length of merged saved patterns) containing which biclusters were merged together after cutting the tree. -
MergedColPatterns
: A list (length of merged saved patterns) containing the indices of which columns make up that pattern. -
MergedNoiseThresholds
: A vector containing the selected noise levels for the merged saved patterns. -
Coverage
: A list containing: 1. a vector of the total number (and percentage) of unique rows the final biclusters cover. 2. a table showing how many rows are used more than a single time in the final biclusters. -
Call
: A match.call of the original function call.
-
Author(s)
Ewoud De Troyer
Examples
## Not run:
## Simulate Data ##
# DATA: 10000x50
# BC1: 200x10
# BC2: 100x10
# BC1 and BC2 overlap 5 columns
# BC3: 200x10
# BC4: 100x10
# BC3 and bC4 overlap 2 columns
# Background 1 percentage: 0.15
# BC Signal Percentage: 0.9
set.seed(273)
mat <- matrix(sample(c(0,1),10000*50,replace=TRUE,prob=c(1-0.15,0.15)),
nrow=10000,ncol=50)
mat[1:200,1:10] <- matrix(sample(c(0,1),200*10,replace=TRUE,prob=c(1-0.9,0.9)),
nrow=200,ncol=10)
mat[300:399,6:15] <- matrix(sample(c(0,1),100*10,replace=TRUE,prob=c(1-0.9,0.9)),
nrow=100,ncol=10)
mat[400:599,21:30] <- matrix(sample(c(0,1),200*10,replace=TRUE,prob=c(1-0.9,0.9)),
nrow=200,ncol=10)
mat[700:799,29:38] <- matrix(sample(c(0,1),100*10,replace=TRUE,prob=c(1-0.9,0.9)),
nrow=100,ncol=10)
mat <- mat[sample(1:10000,10000,replace=FALSE),sample(1:50,50,replace=FALSE)]
# Computing gap statistic for initial 1381 BC takes approx. 15 min.
# Gap Statistic chooses 4 clusters.
out <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2)
summary(out$Biclust)
# Reduce computation by selecting number of clusters manually.
# Note: The "ClusterRowCoverage" function can be used to provided extra info
# on the number of cluster choice.
# How?
# - More clusters result in smaller column patterns and more matching rows.
# - Less clusters result in larger column patterns and less matching rows.
# Step 1: Initial Workflow Run
out2 <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2,cut_type="number",cut_pm=10)
# Step 2: Use ClusterRowCoverage
temp <- ClusterRowCoverage(result=out2,matrix=mat,noise=0.2,plots=2)
# Step 3: Use BiBitWorkflow again (using previously computed parts) with new cut parameter
out3 <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2,cut_type="number",cut_pm=4,
BCresult = out2$info$BiclustInitial,
simmatresult = out2$info$BiclustSimInitial)
summary(out3$Biclust)
## End(Not run)