R: Extract elements within a genomic window

subtrack {Rgb}

R Documentation

Extract elements within a genomic window

Description

subtrack extracts lines from a data.frame, list or vector collection within a single genomic window, defined by a chromosome name, a starting and an ending positions. As this is a common task in genome-wide analysis, this function relies on an optimized C code in order to achieve good performances.

sizetrack is very similar to subtrack, but only count lines without extracting the data.

istrack checks if a collection of data is suitable for subtrack and sizetrack (See 'Track definition' for further details). As this operation is quite expensive and should be performed once, it is up to the user to check its data before subtracking.

Usage

  istrack(...)
  subtrack(...)
  sizetrack(...)

Arguments

...

A collection of data to be considered as a single track. Named vectors are considered as single columns, data.frame and list as collections of columns, all parallelized in a single bidimensional table (assuming they all have same lengths / row counts). See 'Track definition' for further details.

For subtrack and sizetrack, the first arguments must be the following (preferably unnamed) :

chromosome location (integer or character, according to 'chrom' type)
starting position on the chromosome (integer, considered within)
ending position on the chromosome (integer, considered within)
chromosome index (integer vector, see below)

Details

The C code relies heavily on the ordering to fastly retrieve the elements that overlap the queried window. Elements entirely comprised in the window are returned, as well as elements that only partially overlap it.

Value

subtrack returns a single data.frame merging all columns provided, with the subset of rows corresponding to elements in the queried window. This data.frame has no row name, and is a valid track (See 'Track definition' for further details).

sizetrack returns a single integer value corresponding to the count of rows in the queried window.

istrack returns a single TRUE value if the data collection provided is a valid track. Otherwise it returns a single FALSE value, with a "why" attribute containing a single character string explaining the (first) condition that is not fulfilled.

Track definition

A track is defined as a data.frame with a variable amount of data (in columns) about a variable amount of features (in rows).

3 columns are mandatory, with restricted names and types :

chrom: The chromosomal location of the feature, as integer or factor.
start: The starting position of the feature on the chromosome, as integer.
end: The ending position of the feature on the chromosome, as integer.

The track is supposed to be ordered by chromosome, then by starting position. When chromosomes are stored as factors, they need to be numerically ordered by their internal codes (as the order function does), not alphabetically by their labels.

Chromosome index

In order to guarantee good performances, chromosomes are to be indexed. As the rows are supposed to be ordered by chromosome, then by starting position (see 'Track definition'), reminding starting or ending rows of each chromosome can save huge amounts of computation time in large tracks.

The following specifications must be fulfilled :

It must be an integer vector, with the last row index of each chromosome in the track indexed.
Values are to be ordered by chromosome, in the same way than the 'chrom' column.
For integer 'chrom', values are extracted by position (chromosome '1' is the first value ...).
For factor 'chrom', values are extracted by names (named with 'chrom' levels).
Chromosomes without data in the track must be described, with NA integer values.

See the 'Example' section below for index computation.

Note

These three functions are proposed for generic usage on data.frame, list or vectors. The track.table class implements more suitable slice, size and check methods, and handles autonomously the indexing.

Author(s)

Sylvain Mareschal

Examples

  
  # Exemplar data : subset of human genes
  data(hsGenes)
  
  # Track validity
  print(istrack(hsGenes))
  hsGenes <- hsGenes[ order(hsGenes$chrom, hsGenes$start) ,]
  print(istrack(hsGenes))
  
  # Chromosome index (factorial 'chrom')
  index <- tapply(1:nrow(hsGenes), hsGenes$chrom, max)
  
  # Factor chrom query
  print(class(hsGenes$chrom))
  subtrack("1", 10e6, 15e6, index, hsGenes)
  
  # Row count
  a <- nrow(subtrack("1", 10e6, 15e6, index, hsGenes))
  b <- sizetrack("1", 10e6, 15e6, index, hsGenes)
  if(a != b) stop("Inconsistency")
  
  # Multiple sources
  length <- hsGenes$end - hsGenes$start
  subtrack("1", 10e6, 15e6, index, hsGenes, length)
  subtrack("1", 10e6, 15e6, index, hsGenes, length=length)
  
  # Speed comparison (x200 here)
  system.time(
    for(i in 1:40000) {
      subtrack("1", 10e6, 15e6, index, hsGenes)
    }
  )
  system.time(
    for(i in 1:200) {
      hsGenes[ hsGenes$chrom == "1" & hsGenes$start <= 15e6 & hsGenes$end >= 10e6 ,]
    }
  )
  
  # Convert chrom from factor to integer
  hsGenes$chrom <- as.integer(as.character(hsGenes$chrom))
  
  # Chromosome index (integer 'chrom')
  index <- rep(NA_integer_, 24)
  tmpIndex <- tapply(1:nrow(hsGenes), hsGenes$chrom, max)
  index[ as.integer(names(tmpIndex)) ] <- tmpIndex
  
  # Integer chrom query
  print(class(hsGenes$chrom))
  subtrack(1, 10e6, 15e6, index, hsGenes)

[Package Rgb version 1.7.5 Index]