TopDom {TopDom} | R Documentation |
Identify Topological Domains from a Hi-C Contact Matrix
Description
Identify Topological Domains from a Hi-C Contact Matrix
Usage
TopDom(
data,
window.size,
outFile = NULL,
statFilter = TRUE,
...,
debug = getOption("TopDom.debug", FALSE)
)
Arguments
data |
A TopDomData object, or the pathname to a normalized
Hi-C contact matrix file as read by |
window.size |
The number of bins to extend (as a non-negative integer). Recommended range is in 5, ..., 20. |
outFile |
(optional) The filename without extension of the three result files optionally produced. See details below. |
statFilter |
(logical) Specifies whether non-significant topological-domain boundaries should be dropped or not. |
... |
Additional arguments passed to |
debug |
If |
Value
A named list of class TopDom
with data.frame elements
binSignal
, domain
, and bed
.
The
binSignal
data frame (N-by-7) holds mean contact frequency, local extreme, and p-value for every bin. The first four columns represent basic bin information given by matrix file, such as bin id (id
), chromosome(chr
), start coordinate (from.coord
), and end coordinate (to.coord
) for each bin. The last three columns (local.ext
,mean.cf
, andp-value
) represent computed values by the TopDom algorithm. The columns are:-
id
: Bin ID -
chr
: Chromosome -
from.coord
: Start coordinate of bin -
to.coord
: End coordinate of bin -
local.ext
:-
-1
: Local minima. -
-0.5
: Gap region. -
0
: General bin. -
1
: Local maxima.
-
-
mean.cf
: Average of contact frequencies between lower and upper regions for bin i = 1,2,...,N. -
p-value
: Computed p-value by Wilcox rank sum test. See Shin et al. (2016) for more details.
-
The
domain
data frame (D-by-7): Every bin is categorized by basic building block, such as gap, domain, or boundary. Each row indicates a basic building block. The first five columns include the basic information about the block, 'tag' column indicates the class of the building block.-
id
: Identifier of block -
chr
: Chromosome -
from.id
: Start bin index of the block -
from.coord
: Start coordinate of the block -
to.id
: End bin index of the block -
to.coord
: End coordinate of the block -
tag
: Categorized name of the block. Three possible blocks exists:-
gap
-
domain
-
boundary
-
-
size
: size of the block
-
The
bed
data frame (D-by-4) is a representation of thedomain
data frame in the BED file format. It has four columns:-
chrom
: The name of the chromosome. -
chromStart
: The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0. -
chromEnd
: The ending position of the feature in the chromosome. ThechromEnd
base is not included in the feature. For example, the first 100 bases of a chromosome are defined aschromStart=0
,chromEnd=100
, and span the bases numbered 0-99. -
name
: Defines the name of the BED line. This label is displayed to the left of the BED line in the UCSC Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
-
If argument outFile
is non-NULL
, then the three elements (binSignal
,
domain
, and bed
) returned are also written to tab-delimited files
with file names ‘<outFile>.binSignal’, ‘<outFile>.domain’, and
‘<outFile>.bed’, respectively. None of the files have row names,
and all but the BED file have column names.
Windows size
The window.size
parameter is by design the only tuning parameter in the
TopDom method and affects the amount of smoothing applied when calculating
the TopDom bin signals. The binning window extends symmetrically downstream
and upstream from the bin such that the bin signal is the average
window.size^2
contact frequencies.
For details, see Equation (1) and Figure 1 in Shin et al. (2016).
Typically, the number of identified TDs decreases while their average
lengths increase as this window-size parameter increases (Figure 2).
The default is window.size = 5
(bins), which is motivated as:
"Considering the previously reported minimum TD size (approx. 200 kb)
(Dixon et al., 2012) and our bin size of 40 kb, w[indow.size] = 5 is a
reasonable setting" (Shin et al., 2016).
Author(s)
Hanjun Shin, Harris Lazaris, and Gangqing Hu. R package, help, and code refactoring by Henrik Bengtsson.
References
Shin et al., TopDom: an efficient and deterministic method for identifying topological domains in genomes, Nucleic Acids Research, 44(7): e70, April 2016. DOI: 10.1093/nar/gkv1505, PMCID: PMC4838359, PMID: 26704975
Shin et al., R script ‘TopDom_v0.0.2.R’, 2017 (originally from
http://zhoulab.usc.edu/TopDom/
; later available on https://github.com/jasminezhoulab/TopDom via https://zhoulab.dgsom.ucla.edu/pages/software)Shin et al., TopDom Manual, 2016-07-08 (original from
http://zhoulab.usc.edu/TopDom/TopDom%20Manual_v0.0.2.pdf
; later available on https://github.com/jasminezhoulab/TopDom via https://zhoulab.dgsom.ucla.edu/pages/software)Hanjun Shin, Understanding the 3D genome organization in topological domain level, Doctor of Philosophy Dissertation, University of Southern California, March 2017, http://digitallibrary.usc.edu/cdm/ref/collection/p15799coll40/id/347735
Dixon JR, Selvaraj S, Yue F, Kim A, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature; 485(7398):376-80, April 2012. DOI: 10.1038/nature11082, PMCID: PMC3356448, PMID: 22495300.
Examples
path <- system.file("exdata", package = "TopDom", mustWork = TRUE)
## Original count data (on a subset of the bins to speed up example)
chr <- "chr19"
pathname <- file.path(path, sprintf("nij.%s.gz", chr))
data <- readHiC(pathname, chr = chr, binSize = 40e3, bins = 1:500)
print(data) ## a TopDomData object
## Find topological domains using the TopDom method
fit <- TopDom(data, window.size = 5L)
print(fit) ## a TopDom object
## Display the largest domain
td <- subset(subset(fit$domain, tag == "domain"), size == max(size))
print(td) ## a data.frame
## Subset TopDomData object
data_s <- subsetByRegion(data, region = td, margin = 0.9999)
print(data_s) ## a TopDomData object
vp <- grid::viewport(angle = -45, width = 0.7, y = 0.3)
gg <- ggCountHeatmap(data_s)
gg <- gg + ggDomain(td, color = "#cccc00") + ggDomainLabel(td)
print(gg, newpage = TRUE, vp = vp)
gg <- ggCountHeatmap(data_s, colors = list(mid = "white", high = "black"))
gg_td <- ggDomain(td, delta = 0.08)
dx <- attr(gg_td, "gg_params")$dx
gg <- gg + gg_td + ggDomainLabel(td, vjust = 2.5)
print(gg, newpage = TRUE, vp = vp)
## Subset TopDom object
fit_s <- subsetByRegion(fit, region = td, margin = 0.9999)
print(fit_s) ## a TopDom object
for (kk in seq_len(nrow(fit_s$domain))) {
gg <- gg + ggDomain(fit_s$domain[kk, ], dx = dx * (4 + kk %% 2), color = "red", size = 1)
}
print(gg, newpage = TRUE, vp = vp)
gg <- ggCountHeatmap(data_s)
gg_td <- ggDomain(td, delta = 0.08)
dx <- attr(gg_td, "gg_params")$dx
gg <- gg + gg_td + ggDomainLabel(td, vjust = 2.5)
fit_s <- subsetByRegion(fit, region = td, margin = 0.9999)
for (kk in seq_len(nrow(fit_s$domain))) {
gg <- gg + ggDomain(fit_s$domain[kk, ], dx = dx * (4 + kk %% 2), color = "blue", size = 1)
}
print(gg, newpage = TRUE, vp = vp)