R: Identify Topological Domains from a Hi-C Contact Matrix

TopDom {TopDom}

R Documentation

Identify Topological Domains from a Hi-C Contact Matrix

Description

Identify Topological Domains from a Hi-C Contact Matrix

Usage

TopDom(
  data,
  window.size,
  outFile = NULL,
  statFilter = TRUE,
  ...,
  debug = getOption("TopDom.debug", FALSE)
)

Arguments

`data`	A TopDomData object, or the pathname to a normalized Hi-C contact matrix file as read by `readHiC()`, that specify N bins.
`window.size`	The number of bins to extend (as a non-negative integer). Recommended range is in 5, ..., 20.
`outFile`	(optional) The filename without extension of the three result files optionally produced. See details below.
`statFilter`	(logical) Specifies whether non-significant topological-domain boundaries should be dropped or not.
`...`	Additional arguments passed to `readHiC()`.
`debug`	If `TRUE`, debug output is produced.

Value

A named list of class TopDom with data.frame elements binSignal, domain, and bed.

The binSignal data frame (N-by-7) holds mean contact frequency, local extreme, and p-value for every bin. The first four columns represent basic bin information given by matrix file, such as bin id (id), chromosome(chr), start coordinate (from.coord), and end coordinate (to.coord) for each bin. The last three columns (local.ext, mean.cf, and p-value) represent computed values by the TopDom algorithm. The columns are:
- id: Bin ID
- chr: Chromosome
- from.coord: Start coordinate of bin
- to.coord: End coordinate of bin
- local.ext:
  - -1: Local minima.
  - -0.5: Gap region.
  - 0: General bin.
  - 1: Local maxima.
- mean.cf: Average of contact frequencies between lower and upper regions for bin i = 1,2,...,N.
- p-value: Computed p-value by Wilcox rank sum test. See Shin et al. (2016) for more details.
The domain data frame (D-by-7): Every bin is categorized by basic building block, such as gap, domain, or boundary. Each row indicates a basic building block. The first five columns include the basic information about the block, 'tag' column indicates the class of the building block.
- id: Identifier of block
- chr: Chromosome
- from.id: Start bin index of the block
- from.coord: Start coordinate of the block
- to.id: End bin index of the block
- to.coord: End coordinate of the block
- tag: Categorized name of the block. Three possible blocks exists:
  - gap
  - domain
  - boundary
- size: size of the block
The bed data frame (D-by-4) is a representation of the domain data frame in the BED file format. It has four columns:
- chrom: The name of the chromosome.
- chromStart: The starting position of the feature in the chromosome. The first base in a chromosome is numbered 0.
- chromEnd: The ending position of the feature in the chromosome. The chromEnd base is not included in the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
- name: Defines the name of the BED line. This label is displayed to the left of the BED line in the UCSC Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.

If argument outFile is non-NULL, then the three elements (binSignal, domain, and bed) returned are also written to tab-delimited files with file names ‘<outFile>.binSignal’, ‘<outFile>.domain’, and ‘<outFile>.bed’, respectively. None of the files have row names, and all but the BED file have column names.

Windows size

The window.size parameter is by design the only tuning parameter in the TopDom method and affects the amount of smoothing applied when calculating the TopDom bin signals. The binning window extends symmetrically downstream and upstream from the bin such that the bin signal is the average window.size^2 contact frequencies. For details, see Equation (1) and Figure 1 in Shin et al. (2016). Typically, the number of identified TDs decreases while their average lengths increase as this window-size parameter increases (Figure 2). The default is window.size = 5 (bins), which is motivated as: "Considering the previously reported minimum TD size (approx. 200 kb) (Dixon et al., 2012) and our bin size of 40 kb, w[indow.size] = 5 is a reasonable setting" (Shin et al., 2016).

Author(s)

Hanjun Shin, Harris Lazaris, and Gangqing Hu. R package, help, and code refactoring by Henrik Bengtsson.

References

Shin et al., TopDom: an efficient and deterministic method for identifying topological domains in genomes, Nucleic Acids Research, 44(7): e70, April 2016. DOI: 10.1093/nar/gkv1505, PMCID: PMC4838359, PMID: 26704975
Shin et al., R script ‘TopDom_v0.0.2.R’, 2017 (originally from http://zhoulab.usc.edu/TopDom/; later available on https://github.com/jasminezhoulab/TopDom via https://zhoulab.dgsom.ucla.edu/pages/software)
Shin et al., TopDom Manual, 2016-07-08 (original from http://zhoulab.usc.edu/TopDom/TopDom%20Manual_v0.0.2.pdf; later available on https://github.com/jasminezhoulab/TopDom via https://zhoulab.dgsom.ucla.edu/pages/software)
Hanjun Shin, Understanding the 3D genome organization in topological domain level, Doctor of Philosophy Dissertation, University of Southern California, March 2017, http://digitallibrary.usc.edu/cdm/ref/collection/p15799coll40/id/347735
Dixon JR, Selvaraj S, Yue F, Kim A, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature; 485(7398):376-80, April 2012. DOI: 10.1038/nature11082, PMCID: PMC3356448, PMID: 22495300.

Examples

path <- system.file("exdata", package = "TopDom", mustWork = TRUE)

## Original count data (on a subset of the bins to speed up example)
chr <- "chr19"
pathname <- file.path(path, sprintf("nij.%s.gz", chr))
data <- readHiC(pathname, chr = chr, binSize = 40e3, bins = 1:500)
print(data)  ## a TopDomData object

## Find topological domains using the TopDom method
fit <- TopDom(data, window.size = 5L)
print(fit)  ## a TopDom object

## Display the largest domain
td <- subset(subset(fit$domain, tag == "domain"), size == max(size))
print(td) ## a data.frame

## Subset TopDomData object
data_s <- subsetByRegion(data, region = td, margin = 0.9999)
print(data_s)  ## a TopDomData object

vp <- grid::viewport(angle = -45, width = 0.7, y = 0.3)
gg <- ggCountHeatmap(data_s)
gg <- gg + ggDomain(td, color = "#cccc00") + ggDomainLabel(td)
print(gg, newpage = TRUE, vp = vp)

gg <- ggCountHeatmap(data_s, colors = list(mid = "white", high = "black"))
gg_td <- ggDomain(td, delta = 0.08)
dx <- attr(gg_td, "gg_params")$dx
gg <- gg + gg_td + ggDomainLabel(td, vjust = 2.5)
print(gg, newpage = TRUE, vp = vp)

## Subset TopDom object
fit_s <- subsetByRegion(fit, region = td, margin = 0.9999)
print(fit_s)  ## a TopDom object
for (kk in seq_len(nrow(fit_s$domain))) {
  gg <- gg + ggDomain(fit_s$domain[kk, ], dx = dx * (4 + kk %% 2), color = "red", size = 1)
}

print(gg, newpage = TRUE, vp = vp)


gg <- ggCountHeatmap(data_s)
gg_td <- ggDomain(td, delta = 0.08)
dx <- attr(gg_td, "gg_params")$dx
gg <- gg + gg_td + ggDomainLabel(td, vjust = 2.5)
fit_s <- subsetByRegion(fit, region = td, margin = 0.9999)
for (kk in seq_len(nrow(fit_s$domain))) {
  gg <- gg + ggDomain(fit_s$domain[kk, ], dx = dx * (4 + kk %% 2), color = "blue", size = 1)
}

print(gg, newpage = TRUE, vp = vp)

[Package TopDom version 0.10.1 Index]