read {insect} | R Documentation |
Read FASTA and FASTQ files.
Description
Text parsing functions for reading sequences in the FASTA or FASTQ format into R.
Usage
readFASTQ(file = file.choose(), bin = TRUE)
readFASTA(
file = file.choose(),
bin = TRUE,
residues = "DNA",
alignment = FALSE
)
Arguments
file |
the name of the FASTA or FASTQ file from which the sequences are to be read. |
bin |
logical indicating whether the returned object should be in binary/raw byte format (i.e. "DNAbin" or "AAbin" objects for nucleotide and amino acid sequences, respectively). If FALSE a vector of named character strings is returned. |
residues |
character string indicating whether the sequences to
be read are composed of nucleotides ("DNA"; default) or amino acids ("AA").
Only required for |
alignment |
logical indicating whether the sequences represent
an alignment to be parsed as a matrix.
Only applies to |
Details
Compatibility:
The FASTQ convention is somewhat ambiguous with several slightly different interpretations appearing in the literature. For now, this function supports the Illumina convention for FASTQ files, where each sequence and its associated metadata occupies four line of the text file as follows : (1) the run and cluster metadata preceded by an @ symbol; (2) the sequence itself in capitals without spaces; (3) a single "+" symbol; and (4) the Phred quality scores from 0 to 93 represented as ASCII symbols. For more information on this convention see the Illumina help page here .
Speed and Memory Requirements:
For optimal memory efficiency and compatibility with other functions,
it is recommended to store sequences in raw byte format
as either DNAbin or AAbin objects.
For FASTQ files when bin = TRUE, a vector of quality scores
(also in raw-byte format) is attributed to each sequence.
These can be converted back to numeric quality scores with as.integer
.
For FASTQ files when bin = FALSE the function returns a vector with each
sequence as a concatenated string with a similarly concatenated quality attribute
comprised of the same ASCII metacharacters used in the FASTQ coding scheme.
This function can take a while to process larger FASTQ files, a multithreading option may be available in a future version.
Value
Either a vector of character strings (if bin = FALSE), or a list of raw ("DNAbin" or "AAbin") vectors, with each element having a "quality" attribute.
Author(s)
Shaun Wilkinson
References
Bokulich NA, Subramanian S, Faith JJ, Gevers D, Gordon JI, Knight R, Mills DA, Caporaso JG (2013) Quality-filtering vastly improves diversity estimates from Illumina amplicon sequencing. Nat Methods, 1, 57-59.
Illumina help page: https://help.basespace.illumina.com/articles/descriptive/fastq-files/
See Also
writeFASTQ
and writeFASTA
for writing sequences to text in the FASTA or FASTQ format.
See also read.dna
in the ape
package.
Examples
## download and extract example FASTQ file to temporary directory
td <- tempdir()
URL <- "https://www.dropbox.com/s/71ixehy8e51etdd/insect_tutorial1_files.zip?dl=1"
dest <- paste0(td, "/insect_tutorial1_files.zip")
download.file(URL, destfile = dest, mode = "wb")
unzip(dest, exdir = td)
x <- readFASTQ(paste0(td, "/COI_sample2.fastq"))