R: BiBit Workflow

BiBitWorkflow {BiBitR}

R Documentation

BiBit Workflow

Description

Workflow to discover larger (noisy) patterns in big data using BiBit

Usage

BiBitWorkflow(matrix, minr = 2, minc = 2, similarity_type = "col",
  func = "agnes", link = "average", par.method = 0.625,
  cut_type = "gap", cut_pm = "Tibs2001SEmax", gap_B = 500,
  gap_maxK = 50, noise = 0.1, noise_select = 0, plots = c(3:5),
  BCresult = NULL, simmatresult = NULL, treeresult = NULL,
  plot.type = "device", filename = "BiBitWorkflow", verbose = TRUE)

Arguments

`matrix`	The binary input matrix.
`minr`	The minimum number of rows of the Biclusters.
`minc`	The minimum number of columns of the Biclusters.
`similarity_type`	Which dimension to use for the Jaccard Index in Step 2. This is either columns (`"col"`, default) or both (`"both"`).
`func`	Which clustering function to use in Step 3. Either `"agnes"` (= default) or `"hclust"`.
`link`	Which clustering link to use in Step 3. The available links (depending on `func`) are: `hclust`: `"ward.D"`, `"ward.D2"`, `"single"`, `"complete"`, `"average"`, `"mcquitty"`, `"median"` or `"centroid"` `agnes`: `"average"` (default), `"single"`, `"complete"`, `"ward"`, `"weighted"`, `"gaverage"` or `"flexible"` (More details in `hclust` and `agnes`)
`par.method`	Additional parameters used for flexible link (See `agnes`). Default is `c(0.625)`
`cut_type`	Which method should be used to decide the number of clusters in the tree in Step 4? `"gap"`: Use the Gap Statistic (default). `"number"`: Select a set number of clusters. `"height"`: Cut the tree at specific dissimilarity height.
`cut_pm`	Cut Parameter (depends on `cut_type`) for Step 4 Gap Statistic (`cut_type="gap"`): How to compute optimal number of clusters? Choose one of the following: `"Tibs2001SEmax"` (default), `"globalmax"`, `"firstmax"`, `"firstSEmax"` or `"globalSEmax"`. Number (`cut_type="number"`): Integer for number of clusters. Height (`cut_type="height"`): Numeric dissimilarity value where the tree should be cut (`[0,1]`).
`gap_B`	Number of bootstrap samples (default=500) for Gap Statistic (`clusGap`).
`gap_maxK`	Number of clusters to consider (default=50) for Gap Statistic (`clusGap`).
`noise`	The allowed noise level when growing the rows on the merged patterns in Step 6. (default=`0.1`, namely allow 10% noise.) `noise=0`: No noise allowed. `0<noise<1`: The `noise` parameter will be a noise percentage. The number of allowed 0's in a row in the bicluster will depend on the column size of the bicluster. More specifically `zeros_allowed = ceiling(noise * columnsize)`. For example for `noise=0.10` and a bicluster column size of `5`, the number of allowed 0's would be `1`. `noise>=1`: The `noise` parameter will be the number of allowed 0's in a row in the bicluster independent from the column size of the bicluster. In this noise option, the noise parameter should be an integer.
`noise_select`	Should the allowed noise level be automatically selected for each pattern? (Using ad hoc method to find the elbow/kink in the Noise Scree plots) `noise_select=0`: Do NOT automatically select the noise levels. Use the the noise level given in the `noise` parameter (default). `noise_select=1`: Using the Noise Scree plot (with 'Added Rows' on the y-axis), find the noise level where the current number of added rows at this noise level is larger than the mean of 'added rows' at the lower noise levels. After locating this noise level, lower the noise level by 1. This is your automatically selected elbow/kink and therefore your noise level. `noise_select=2`: Applies the same steps as for `noise_select=1`, but instead of decreasing the noise level by only 1, keep decreasing the noise level until the number of added rows isn't decreasing anymore either.
`plots`	Vector for which plots to draw: Image plot of the similarity matrix computed in Step 2. Same as `plots=1`, but the rows and columns are reordered with the hierarchical tree. Dendrogram of the tree, its clusters colored after the chosen cut has been applied. Noise Scree plots for all the Saved Patterns. Two plots will be plotted, both with Noise on the x-axis. The first one will have the number of Added Number of Rows on that noise level on the y-axis, while the second will have the Total Number of Rows (i.e. cumulative of the first). If the title of one of the subplots is red, then this means that the Bicluster grown from this pattern, using the chosen noise level, was eventually deleted due to being a duplicate or non-maximal. Image plot of the Jaccard Index similarity matrix between the final biclusters after Step 6.
`BCresult`	Import a BiBit Biclust result for Step 1 (e.g. extract from an older BiBitWorkflow object `$info$BiclustInitial`). This can be useful if you want to cut the tree differently/make different plots, but don't want to do the BiBit calculation again.
`simmatresult`	Import a (custom) Similarity Matrix (e.g. extract from older BiBitWorkflow object `$info$BiclustSimInitial`). Note that Step 1 (BiBit) will still be executed if `BCresult` is not provided.
`treeresult`	Import a (custom) tree (`hclust` object) based on the BiBit/Similarity (e.g. extract from older BiBitWorkflow object `$info$Tree`).
`plot.type`	Output Type `"device"`: All plots are outputted to new R graphics devices (default). `"file"`: All plots are saved in external files. Plots 1 and 2 are saved in separate `.png` files while all other plots are joint together in a single `.pdf` file. `"other"`: All plots are outputted to the current graphics device, but will overwrite each other. Use this if you want to include one or more plots in a sweave/knitr file or if you want to export a single plot by your own chosen format.
`filename`	Base filename (with/without directory) for the plots if `plot.type="file"` (default=`"BiBitWorkflow"`).
`verbose`	Logical value if progress of workflow should be printed.

Details

Looking for Noisy Biclusters in large data using BiBit (bibit2) often results in many (overlapping) biclusters. In order decrease the number of biclusters and find larger meaningful patterns which make up noisy biclusters, the following workflow can be applied. Note that this workflow is primarily used for data where there are many more rows (e.g. patients) than columns (e.g. symptoms). For example the workflow would discover larger meaningful symptom patterns which, conditioned on the allowed noise/zeros, subsets of the patients share.

Apply BiBit with no noise (Preferably with high enough minr and minc).
Compute Similarity Matrix (Jaccard Index) of all biclusters. By default this measure is only based on column similarity. This implies that the rows of the BC's are not of interest in this step. The goal then would be to discover highly overlapping column patterns and, in the next steps, merge them together.
Apply Agglomerative Hierarchical Clustering on Similarity Matrix (default = average link)
Cut the dendrogram of the clustering result and merge the biclusters based on this. (default = number of clusters is determined by the Tibs2001SEmax Gap Statistic)
Extract Column Memberships of the Merged Biclusters. These are saved as the new column Patterns.
Starting from these patterns, (noisy) rows are grown which match the pattern, creating a single final bicluster for each pattern. At the end duplicate/non-maximal BC's are deleted.

Using the described workflow (and column similarity in Step 2), the final result will contain biclusters which focus on larger column patterns.

Value

A BiBitWorkflow S3 List Object with 3 slots:

Biclust: Biclust Class Object of Final Biclustering Result (after Step 6).
BiclustSim: Jaccard Index Similarity Matrix of Final Biclustering Result (after Step 6).
info: List Object containing:
- BiclustInitial: Biclust Class Object of Initial Biclustering Result (after Step 1).
- BiclustSimInitial: Jaccard Index Similarity Matrix of Initial Biclustering Result (after Step 1).
- Tree: Hierarchical Tree of BiclustSimInitial as hclust object.
- Number: Vector containing the initial number of biclusters (InitialNumber), the number of saved patterns after cutting the tree (PatternNumber) and the final number of biclusters (FinalNumber).
- GapStat: Vector containing all different optimal cluster numbers based on the Gap Statistic.
- BC.Merge: A list (length of merged saved patterns) containing which biclusters were merged together after cutting the tree.
- MergedColPatterns: A list (length of merged saved patterns) containing the indices of which columns make up that pattern.
- MergedNoiseThresholds: A vector containing the selected noise levels for the merged saved patterns.
- Coverage: A list containing: 1. a vector of the total number (and percentage) of unique rows the final biclusters cover. 2. a table showing how many rows are used more than a single time in the final biclusters.
- Call: A match.call of the original function call.

Author(s)

Ewoud De Troyer

Examples

## Not run: 
## Simulate Data ##
# DATA: 10000x50
# BC1: 200x10
# BC2: 100x10
# BC1 and BC2 overlap 5 columns

# BC3: 200x10
# BC4: 100x10
# BC3 and bC4 overlap 2 columns

# Background 1 percentage: 0.15
# BC Signal Percentage: 0.9
 
set.seed(273)
mat <- matrix(sample(c(0,1),10000*50,replace=TRUE,prob=c(1-0.15,0.15)),
              nrow=10000,ncol=50)
mat[1:200,1:10] <- matrix(sample(c(0,1),200*10,replace=TRUE,prob=c(1-0.9,0.9)),
                          nrow=200,ncol=10)
mat[300:399,6:15] <- matrix(sample(c(0,1),100*10,replace=TRUE,prob=c(1-0.9,0.9)),
                            nrow=100,ncol=10)
mat[400:599,21:30] <- matrix(sample(c(0,1),200*10,replace=TRUE,prob=c(1-0.9,0.9)),
                             nrow=200,ncol=10)
mat[700:799,29:38] <- matrix(sample(c(0,1),100*10,replace=TRUE,prob=c(1-0.9,0.9)),
                             nrow=100,ncol=10)
mat <- mat[sample(1:10000,10000,replace=FALSE),sample(1:50,50,replace=FALSE)]


# Computing gap statistic for initial 1381 BC takes approx. 15 min.
# Gap Statistic chooses 4 clusters. 
out <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2) 
summary(out$Biclust)

# Reduce computation by selecting number of clusters manually.
# Note: The "ClusterRowCoverage" function can be used to provided extra info 
#       on the number of cluster choice.
#       How?
#       - More clusters result in smaller column patterns and more matching rows.
#       - Less clusters result in larger column patterns and less matching rows.
# Step 1: Initial Workflow Run
out2 <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2,cut_type="number",cut_pm=10)
# Step 2: Use ClusterRowCoverage
temp <- ClusterRowCoverage(result=out2,matrix=mat,noise=0.2,plots=2)
# Step 3: Use BiBitWorkflow again (using previously computed parts) with new cut parameter
out3 <- BiBitWorkflow(matrix=mat,minr=50,minc=5,noise=0.2,cut_type="number",cut_pm=4,
                      BCresult = out2$info$BiclustInitial,
                      simmatresult = out2$info$BiclustSimInitial)
summary(out3$Biclust)

## End(Not run)

[Package BiBitR version 0.3.1 Index]