R: Extract sequence lengths and letter counts from a .2bit file

twobit_seqstats {Rtwobitlib}

R Documentation

Extract sequence lengths and letter counts from a .2bit file

Description

Extract the lengths and letter counts of the DNA sequences stored in a .2bit file.

Usage

twobit_seqstats(filepath)

twobit_seqlengths(filepath)

Arguments

filepath

A single string (character vector of length 1) containing a path to a .2bit file.

Details

twobit_seqlengths(filepath) is a shortcut for twobit_seqstats(filepath)[ , "seqlengths"] that is also a much more efficient way to get the sequence lengths as it does not need to load the sequence data in memory.

Value

For twobit_seqstats(): An integer matrix with one row per sequence in the .2bit file and 6 columns. The rownames on the matrix are the sequence names and the colnames are: seqlengths, A, C, G, T, N. Columns A, C, G, T, and N contain the letter count for each sequence.

For twobit_seqlengths(): A named integer vector where the names are the sequence names and the values the corresponding lengths.

References

A quick overview of the 2bit format: https://genome.ucsc.edu/FAQ/FAQformat.html#format7

Examples

filepath <- system.file(package="Rtwobitlib", "extdata", "sacCer2.2bit")

twobit_seqstats(filepath)

twobit_seqlengths(filepath)

## Sanity checks:
sacCer2_seqstats <- twobit_seqstats(filepath)
stopifnot(
  identical(sacCer2_seqstats[ , 1], twobit_seqlengths(filepath)),
  all.equal(rowSums(sacCer2_seqstats[ , -1]), sacCer2_seqstats[ , 1])
)

[Package Rtwobitlib version 0.3.6 Index]