BiBitWorkflow {BiBitR}  R Documentation 
Workflow to discover larger (noisy) patterns in big data using BiBit
BiBitWorkflow(matrix, minr = 2, minc = 2, similarity_type = "col", func = "agnes", link = "average", par.method = 0.625, cut_type = "gap", cut_pm = "Tibs2001SEmax", gap_B = 500, gap_maxK = 50, noise = 0.1, noise_select = 0, plots = c(3:5), BCresult = NULL, simmatresult = NULL, treeresult = NULL, plot.type = "device", filename = "BiBitWorkflow", verbose = TRUE)
matrix 
The binary input matrix. 
minr 
The minimum number of rows of the Biclusters. 
minc 
The minimum number of columns of the Biclusters. 
similarity_type 
Which dimension to use for the Jaccard Index in Step 2. This is either columns ( 
func 
Which clustering function to use in Step 3. Either 
link 
Which clustering link to use in Step 3. The available links (depending on

par.method 
Additional parameters used for flexible link (See 
cut_type 
Which method should be used to decide the number of clusters in the tree in Step 4?

cut_pm 
Cut Parameter (depends on

gap_B 
Number of bootstrap samples (default=500) for Gap Statistic ( 
gap_maxK 
Number of clusters to consider (default=50) for Gap Statistic ( 
noise 
The allowed noise level when growing the rows on the merged patterns in Step 6. (default=

noise_select 
Should the allowed noise level be automatically selected for each pattern? (Using ad hoc method to find the elbow/kink in the Noise Scree plots)

plots 
Vector for which plots to draw:

BCresult 
Import a BiBit Biclust result for Step 1 (e.g. extract from an older BiBitWorkflow object 
simmatresult 
Import a (custom) Similarity Matrix (e.g. extract from older BiBitWorkflow object 
treeresult 
Import a (custom) tree ( 
plot.type 
Output Type

filename 
Base filename (with/without directory) for the plots if 
verbose 
Logical value if progress of workflow should be printed. 
Looking for Noisy Biclusters in large data using BiBit (bibit2
) often results in many (overlapping) biclusters.
In order decrease the number of biclusters and find larger meaningful patterns which make up noisy biclusters, the following workflow can be applied.
Note that this workflow is primarily used for data where there are many more rows (e.g. patients) than columns (e.g. symptoms). For example the workflow would discover larger meaningful symptom patterns which, conditioned on the allowed noise/zeros, subsets of the patients share.
Apply BiBit with no noise (Preferably with high enough minr
and minc
).
Compute Similarity Matrix (Jaccard Index) of all biclusters. By default this measure is only based on column similarity. This implies that the rows of the BC's are not of interest in this step. The goal then would be to discover highly overlapping column patterns and, in the next steps, merge them together.
Apply Agglomerative Hierarchical Clustering on Similarity Matrix (default = average link)
Cut the dendrogram of the clustering result and merge the biclusters based on this. (default = number of clusters is determined by the Tibs2001SEmax Gap Statistic)
Extract Column Memberships of the Merged Biclusters. These are saved as the new column Patterns.
Starting from these patterns, (noisy) rows are grown which match the pattern, creating a single final bicluster for each pattern. At the end duplicate/nonmaximal BC's are deleted.
Using the described workflow (and column similarity in Step 2), the final result will contain biclusters which focus on larger column patterns.
A BiBitWorkflow S3 List Object with 3 slots:
Biclust
: Biclust Class Object of Final Biclustering Result (after Step 6).
BiclustSim
: Jaccard Index Similarity Matrix of Final Biclustering Result (after Step 6).
info
: List Object containing:
BiclustInitial
: Biclust Class Object of Initial Biclustering Result (after Step 1).
BiclustSimInitial
: Jaccard Index Similarity Matrix of Initial Biclustering Result (after Step 1).
Tree
: Hierarchical Tree of BiclustSimInitial
as hclust
object.
Number
: Vector containing the initial number of biclusters (InitialNumber
), the number of saved patterns after cutting the tree (PatternNumber
) and the final number of biclusters (FinalNumber
).
GapStat
: Vector containing all different optimal cluster numbers based on the Gap Statistic.
BC.Merge
: A list (length of merged saved patterns) containing which biclusters were merged together after cutting the tree.
MergedColPatterns
: A list (length of merged saved patterns) containing the indices of which columns make up that pattern.
MergedNoiseThresholds
: A vector containing the selected noise levels for the merged saved patterns.
Coverage
: A list containing: 1. a vector of the total number (and percentage) of unique rows the final biclusters cover. 2. a table showing how many rows are used more than a single time in the final biclusters.
Call
: A match.call of the original function call.
Ewoud De Troyer
## Not run: ## Simulate Data ## # DATA: 10000x50 # BC1: 200x10 # BC2: 100x10 # BC1 and BC2 overlap 5 columns # BC3: 200x10 # BC4: 100x10 # BC3 and bC4 overlap 2 columns # Background 1 percentage: 0.15 # BC Signal Percentage: 0.9 set.seed(273) mat < matrix(sample(c(0,1),10000*50,replace=TRUE,prob=c(10.15,0.15)), nrow=10000,ncol=50) mat[1:200,1:10] < matrix(sample(c(0,1),200*10,replace=TRUE,prob=c(10.9,0.9)), nrow=200,ncol=10) mat[300:399,6:15] < matrix(sample(c(0,1),100*10,replace=TRUE,prob=c(10.9,0.9)), nrow=100,ncol=10) mat[400:599,21:30] < matrix(sample(c(0,1),200*10,replace=TRUE,prob=c(10.9,0.9)), nrow=200,ncol=10) mat[700:799,29:38] < matrix(sample(c(0,1),100*10,replace=TRUE,prob=c(10.9,0.9)), nrow=100,ncol=10) mat < mat[sample(1:10000,10000,replace=FALSE),sample(1:50,50,replace=FALSE)] # Computing gap statistic for initial 1381 BC takes approx. 15 min. # Gap Statistic chooses 4 clusters. out < BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2) summary(out$Biclust) # Reduce computation by selecting number of clusters manually. # Note: The "ClusterRowCoverage" function can be used to provided extra info # on the number of cluster choice. # How? #  More clusters result in smaller column patterns and more matching rows. #  Less clusters result in larger column patterns and less matching rows. # Step 1: Initial Workflow Run out2 < BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2,cut_type="number",cut_pm=10) # Step 2: Use ClusterRowCoverage temp < ClusterRowCoverage(result=out2,matrix=mat,noise=0.2,plots=2) # Step 3: Use BiBitWorkflow again (using previously computed parts) with new cut parameter out3 < BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2,cut_type="number",cut_pm=4, BCresult = out2$info$BiclustInitial, simmatresult = out2$info$BiclustSimInitial) summary(out3$Biclust) ## End(Not run)