subtrack {Rgb} | R Documentation |
Extract elements within a genomic window
Description
subtrack
extracts lines from a data.frame
, list
or vector
collection within a single genomic window, defined by a chromosome name, a starting and an ending positions. As this is a common task in genome-wide analysis, this function relies on an optimized C code in order to achieve good performances.
sizetrack
is very similar to subtrack
, but only count lines without extracting the data.
istrack
checks if a collection of data is suitable for subtrack
and sizetrack
(See 'Track definition' for further details). As this operation is quite expensive and should be performed once, it is up to the user to check its data before subtracking.
Usage
istrack(...)
subtrack(...)
sizetrack(...)
Arguments
... |
A collection of data to be considered as a single track. Named vectors are considered as single columns, For
|
Details
The C code relies heavily on the ordering to fastly retrieve the elements that overlap the queried window. Elements entirely comprised in the window are returned, as well as elements that only partially overlap it.
Value
subtrack
returns a single data.frame
merging all columns provided, with the subset of rows corresponding to elements in the queried window. This data.frame
has no row name, and is a valid track (See 'Track definition' for further details).
sizetrack
returns a single integer
value corresponding to the count of rows in the queried window.
istrack
returns a single TRUE
value if the data collection provided is a valid track. Otherwise it returns a single FALSE
value, with a "why" attribute containing a single character string explaining the (first) condition that is not fulfilled.
Track definition
A track is defined as a data.frame
with a variable amount of data (in columns) about a variable amount of features (in rows).
3 columns are mandatory, with restricted names and types :
- chrom
The chromosomal location of the feature, as
integer
orfactor
.- start
The starting position of the feature on the chromosome, as
integer
.- end
The ending position of the feature on the chromosome, as
integer
.
The track is supposed to be ordered by chromosome, then by starting position. When chromosomes are stored as factors
, they need to be numerically ordered by their internal codes (as the order
function does), not alphabetically by their labels.
Chromosome index
In order to guarantee good performances, chromosomes are to be indexed. As the rows are supposed to be ordered by chromosome, then by starting position (see 'Track definition'), reminding starting or ending rows of each chromosome can save huge amounts of computation time in large tracks.
The following specifications must be fulfilled :
It must be an
integer
vector, with the last row index of each chromosome in the track indexed.Values are to be ordered by chromosome, in the same way than the 'chrom' column.
For
integer
'chrom', values are extracted by position (chromosome '1' is the first value ...).For
factor
'chrom', values are extracted by names (named with 'chrom' levels).Chromosomes without data in the track must be described, with
NA integer
values.
See the 'Example' section below for index computation.
Note
These three functions are proposed for generic usage on data.frame
, list
or vectors. The track.table
class implements more suitable slice
, size
and check
methods, and handles autonomously the indexing.
Author(s)
Sylvain Mareschal
Examples
# Exemplar data : subset of human genes
data(hsGenes)
# Track validity
print(istrack(hsGenes))
hsGenes <- hsGenes[ order(hsGenes$chrom, hsGenes$start) ,]
print(istrack(hsGenes))
# Chromosome index (factorial 'chrom')
index <- tapply(1:nrow(hsGenes), hsGenes$chrom, max)
# Factor chrom query
print(class(hsGenes$chrom))
subtrack("1", 10e6, 15e6, index, hsGenes)
# Row count
a <- nrow(subtrack("1", 10e6, 15e6, index, hsGenes))
b <- sizetrack("1", 10e6, 15e6, index, hsGenes)
if(a != b) stop("Inconsistency")
# Multiple sources
length <- hsGenes$end - hsGenes$start
subtrack("1", 10e6, 15e6, index, hsGenes, length)
subtrack("1", 10e6, 15e6, index, hsGenes, length=length)
# Speed comparison (x200 here)
system.time(
for(i in 1:40000) {
subtrack("1", 10e6, 15e6, index, hsGenes)
}
)
system.time(
for(i in 1:200) {
hsGenes[ hsGenes$chrom == "1" & hsGenes$start <= 15e6 & hsGenes$end >= 10e6 ,]
}
)
# Convert chrom from factor to integer
hsGenes$chrom <- as.integer(as.character(hsGenes$chrom))
# Chromosome index (integer 'chrom')
index <- rep(NA_integer_, 24)
tmpIndex <- tapply(1:nrow(hsGenes), hsGenes$chrom, max)
index[ as.integer(names(tmpIndex)) ] <- tmpIndex
# Integer chrom query
print(class(hsGenes$chrom))
subtrack(1, 10e6, 15e6, index, hsGenes)