R: Read FASTA and FASTQ files.

read {insect}

R Documentation

Read FASTA and FASTQ files.

Description

Text parsing functions for reading sequences in the FASTA or FASTQ format into R.

Usage

readFASTQ(file = file.choose(), bin = TRUE)

readFASTA(
  file = file.choose(),
  bin = TRUE,
  residues = "DNA",
  alignment = FALSE
)

Arguments

`file`	the name of the FASTA or FASTQ file from which the sequences are to be read.
`bin`	logical indicating whether the returned object should be in binary/raw byte format (i.e. "DNAbin" or "AAbin" objects for nucleotide and amino acid sequences, respectively). If FALSE a vector of named character strings is returned.
`residues`	character string indicating whether the sequences to be read are composed of nucleotides ("DNA"; default) or amino acids ("AA"). Only required for `readFASTA` and if `bin = TRUE`.
`alignment`	logical indicating whether the sequences represent an alignment to be parsed as a matrix. Only applies to `readFASTA`.

Details

Compatibility:

The FASTQ convention is somewhat ambiguous with several slightly different interpretations appearing in the literature. For now, this function supports the Illumina convention for FASTQ files, where each sequence and its associated metadata occupies four line of the text file as follows : (1) the run and cluster metadata preceded by an @ symbol; (2) the sequence itself in capitals without spaces; (3) a single "+" symbol; and (4) the Phred quality scores from 0 to 93 represented as ASCII symbols. For more information on this convention see the Illumina help page here .

Speed and Memory Requirements:

For optimal memory efficiency and compatibility with other functions, it is recommended to store sequences in raw byte format as either DNAbin or AAbin objects. For FASTQ files when bin = TRUE, a vector of quality scores (also in raw-byte format) is attributed to each sequence. These can be converted back to numeric quality scores with as.integer. For FASTQ files when bin = FALSE the function returns a vector with each sequence as a concatenated string with a similarly concatenated quality attribute comprised of the same ASCII metacharacters used in the FASTQ coding scheme.

This function can take a while to process larger FASTQ files, a multithreading option may be available in a future version.

Value

Either a vector of character strings (if bin = FALSE), or a list of raw ("DNAbin" or "AAbin") vectors, with each element having a "quality" attribute.

Author(s)

Shaun Wilkinson

References

Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, Knight R, Mills DA, Caporaso JG (2013) Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods, 1, 57-59.

Illumina help page: https://help.basespace.illumina.com/articles/descriptive/fastq-files/

Examples


  ## download and extract example FASTQ file to temporary directory
  td <- tempdir()
  URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1"
  dest <- paste0(td, "/insect_tutorial1_files.zip")
  download.file(URL, destfile = dest, mode = "wb")
  unzip(dest, exdir = td)
  x <- readFASTQ(paste0(td, "/COI_sample2.fastq"))

[Package insect version 1.4.2 Index]