BiBitWorkflow {BiBitR}R Documentation

BiBit Workflow

Description

Workflow to discover larger (noisy) patterns in big data using BiBit

Usage

BiBitWorkflow(matrix, minr = 2, minc = 2, similarity_type = "col",
  func = "agnes", link = "average", par.method = 0.625,
  cut_type = "gap", cut_pm = "Tibs2001SEmax", gap_B = 500,
  gap_maxK = 50, noise = 0.1, noise_select = 0, plots = c(3:5),
  BCresult = NULL, simmatresult = NULL, treeresult = NULL,
  plot.type = "device", filename = "BiBitWorkflow", verbose = TRUE)

Arguments

matrix

The binary input matrix.

minr

The minimum number of rows of the Biclusters.

minc

The minimum number of columns of the Biclusters.

similarity_type

Which dimension to use for the Jaccard Index in Step 2. This is either columns ("col", default) or both ("both").

func

Which clustering function to use in Step 3. Either "agnes" (= default) or "hclust".

link

Which clustering link to use in Step 3. The available links (depending on func) are:

  • hclust: "ward.D", "ward.D2", "single", "complete", "average", "mcquitty", "median" or "centroid"

  • agnes: "average" (default), "single", "complete", "ward", "weighted", "gaverage" or "flexible"

(More details in hclust and agnes)

par.method

Additional parameters used for flexible link (See agnes). Default is c(0.625)

cut_type

Which method should be used to decide the number of clusters in the tree in Step 4?

  • "gap": Use the Gap Statistic (default).

  • "number": Select a set number of clusters.

  • "height": Cut the tree at specific dissimilarity height.

cut_pm

Cut Parameter (depends on cut_type) for Step 4

  • Gap Statistic (cut_type="gap"): How to compute optimal number of clusters? Choose one of the following: "Tibs2001SEmax" (default), "globalmax", "firstmax", "firstSEmax" or "globalSEmax".

  • Number (cut_type="number"): Integer for number of clusters.

  • Height (cut_type="height"): Numeric dissimilarity value where the tree should be cut ([0,1]).

gap_B

Number of bootstrap samples (default=500) for Gap Statistic (clusGap).

gap_maxK

Number of clusters to consider (default=50) for Gap Statistic (clusGap).

noise

The allowed noise level when growing the rows on the merged patterns in Step 6. (default=0.1, namely allow 10% noise.)

  • noise=0: No noise allowed.

  • 0<noise<1: The noise parameter will be a noise percentage. The number of allowed 0's in a row in the bicluster will depend on the column size of the bicluster. More specifically zeros_allowed = ceiling(noise * columnsize). For example for noise=0.10 and a bicluster column size of 5, the number of allowed 0's would be 1.

  • noise>=1: The noise parameter will be the number of allowed 0's in a row in the bicluster independent from the column size of the bicluster. In this noise option, the noise parameter should be an integer.

noise_select

Should the allowed noise level be automatically selected for each pattern? (Using ad hoc method to find the elbow/kink in the Noise Scree plots)

  • noise_select=0: Do NOT automatically select the noise levels. Use the the noise level given in the noise parameter (default).

  • noise_select=1: Using the Noise Scree plot (with 'Added Rows' on the y-axis), find the noise level where the current number of added rows at this noise level is larger than the mean of 'added rows' at the lower noise levels. After locating this noise level, lower the noise level by 1. This is your automatically selected elbow/kink and therefore your noise level.

  • noise_select=2: Applies the same steps as for noise_select=1, but instead of decreasing the noise level by only 1, keep decreasing the noise level until the number of added rows isn't decreasing anymore either.

plots

Vector for which plots to draw:

  1. Image plot of the similarity matrix computed in Step 2.

  2. Same as plots=1, but the rows and columns are reordered with the hierarchical tree.

  3. Dendrogram of the tree, its clusters colored after the chosen cut has been applied.

  4. Noise Scree plots for all the Saved Patterns. Two plots will be plotted, both with Noise on the x-axis. The first one will have the number of Added Number of Rows on that noise level on the y-axis, while the second will have the Total Number of Rows (i.e. cumulative of the first). If the title of one of the subplots is red, then this means that the Bicluster grown from this pattern, using the chosen noise level, was eventually deleted due to being a duplicate or non-maximal.

  5. Image plot of the Jaccard Index similarity matrix between the final biclusters after Step 6.

BCresult

Import a BiBit Biclust result for Step 1 (e.g. extract from an older BiBitWorkflow object $info$BiclustInitial). This can be useful if you want to cut the tree differently/make different plots, but don't want to do the BiBit calculation again.

simmatresult

Import a (custom) Similarity Matrix (e.g. extract from older BiBitWorkflow object $info$BiclustSimInitial). Note that Step 1 (BiBit) will still be executed if BCresult is not provided.

treeresult

Import a (custom) tree (hclust object) based on the BiBit/Similarity (e.g. extract from older BiBitWorkflow object $info$Tree).

plot.type

Output Type

  • "device": All plots are outputted to new R graphics devices (default).

  • "file": All plots are saved in external files. Plots 1 and 2 are saved in separate .png files while all other plots are joint together in a single .pdf file.

  • "other": All plots are outputted to the current graphics device, but will overwrite each other. Use this if you want to include one or more plots in a sweave/knitr file or if you want to export a single plot by your own chosen format.

filename

Base filename (with/without directory) for the plots if plot.type="file" (default="BiBitWorkflow").

verbose

Logical value if progress of workflow should be printed.

Details

Looking for Noisy Biclusters in large data using BiBit (bibit2) often results in many (overlapping) biclusters. In order decrease the number of biclusters and find larger meaningful patterns which make up noisy biclusters, the following workflow can be applied. Note that this workflow is primarily used for data where there are many more rows (e.g. patients) than columns (e.g. symptoms). For example the workflow would discover larger meaningful symptom patterns which, conditioned on the allowed noise/zeros, subsets of the patients share.

  1. Apply BiBit with no noise (Preferably with high enough minr and minc).

  2. Compute Similarity Matrix (Jaccard Index) of all biclusters. By default this measure is only based on column similarity. This implies that the rows of the BC's are not of interest in this step. The goal then would be to discover highly overlapping column patterns and, in the next steps, merge them together.

  3. Apply Agglomerative Hierarchical Clustering on Similarity Matrix (default = average link)

  4. Cut the dendrogram of the clustering result and merge the biclusters based on this. (default = number of clusters is determined by the Tibs2001SEmax Gap Statistic)

  5. Extract Column Memberships of the Merged Biclusters. These are saved as the new column Patterns.

  6. Starting from these patterns, (noisy) rows are grown which match the pattern, creating a single final bicluster for each pattern. At the end duplicate/non-maximal BC's are deleted.

Using the described workflow (and column similarity in Step 2), the final result will contain biclusters which focus on larger column patterns.

Value

A BiBitWorkflow S3 List Object with 3 slots:

Author(s)

Ewoud De Troyer

Examples

## Not run: 
## Simulate Data ##
# DATA: 10000x50
# BC1: 200x10
# BC2: 100x10
# BC1 and BC2 overlap 5 columns

# BC3: 200x10
# BC4: 100x10
# BC3 and bC4 overlap 2 columns

# Background 1 percentage: 0.15
# BC Signal Percentage: 0.9
 
set.seed(273)
mat <- matrix(sample(c(0,1),10000*50,replace=TRUE,prob=c(1-0.15,0.15)),
              nrow=10000,ncol=50)
mat[1:200,1:10] <- matrix(sample(c(0,1),200*10,replace=TRUE,prob=c(1-0.9,0.9)),
                          nrow=200,ncol=10)
mat[300:399,6:15] <- matrix(sample(c(0,1),100*10,replace=TRUE,prob=c(1-0.9,0.9)),
                            nrow=100,ncol=10)
mat[400:599,21:30] <- matrix(sample(c(0,1),200*10,replace=TRUE,prob=c(1-0.9,0.9)),
                             nrow=200,ncol=10)
mat[700:799,29:38] <- matrix(sample(c(0,1),100*10,replace=TRUE,prob=c(1-0.9,0.9)),
                             nrow=100,ncol=10)
mat <- mat[sample(1:10000,10000,replace=FALSE),sample(1:50,50,replace=FALSE)]


# Computing gap statistic for initial 1381 BC takes approx. 15 min.
# Gap Statistic chooses 4 clusters. 
out <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2) 
summary(out$Biclust)

# Reduce computation by selecting number of clusters manually.
# Note: The "ClusterRowCoverage" function can be used to provided extra info 
#       on the number of cluster choice.
#       How?
#       - More clusters result in smaller column patterns and more matching rows.
#       - Less clusters result in larger column patterns and less matching rows.
# Step 1: Initial Workflow Run
out2 <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2,cut_type="number",cut_pm=10)
# Step 2: Use ClusterRowCoverage
temp <- ClusterRowCoverage(result=out2,matrix=mat,noise=0.2,plots=2)
# Step 3: Use BiBitWorkflow again (using previously computed parts) with new cut parameter
out3 <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2,cut_type="number",cut_pm=4,
                      BCresult = out2$info$BiclustInitial,
                      simmatresult = out2$info$BiclustSimInitial)
summary(out3$Biclust)

## End(Not run)

[Package BiBitR version 0.3.1 Index]