readFastqDb {alakazam}R Documentation

Load sequencing quality scores from a FASTQ file

Description

readFastqDb adds the sequencing quality scores to a data.frame from a FASTQ file. Matching is done by 'sequence_id'.

Usage

readFastqDb(
  data,
  fastq_file,
  quality_offset = -33,
  header = c("presto", "asis"),
  sequence_id = "sequence_id",
  sequence = "sequence",
  sequence_alignment = "sequence_alignment",
  v_cigar = "v_cigar",
  d_cigar = "d_cigar",
  j_cigar = "j_cigar",
  np1_length = "np1_length",
  np2_length = "np2_length",
  v_sequence_end = "v_sequence_end",
  d_sequence_end = "d_sequence_end",
  style = c("num", "ascii", "both"),
  quality_sequence = FALSE
)

Arguments

data

data.frame containing sequence data.

fastq_file

path to the fastq file

quality_offset

offset value to be used by ape::read.fastq. It is the value to be added to the quality scores (the default -33 applies to the Sanger format and should work for most recent FASTQ files).

header

FASTQ file header format; one of "presto" or "asis". Use "presto" to specify that the fastq file headers are using the pRESTO format and can be parsed to extract the sequence_id. Use "asis" to skip any processing and use the sequence names as they are.

sequence_id

column in data that contains sequence identifiers to be matched to sequence identifiers in fastq_file.

sequence

column in data that contains sequence data.

sequence_alignment

column in data that contains IMGT aligned sequence data.

v_cigar

column in data that contains CIGAR strings for the V gene alignments.

d_cigar

column in data that contains CIGAR strings for the D gene alignments.

j_cigar

column in data that contains CIGAR strings for the J gene alignments.

np1_length

column in data that contains the number of nucleotides between the V gene and first D gene alignments or between the V gene and J gene alignments.

np2_length

column in data that contains the number of nucleotides between either the first D gene and J gene alignments or the first D gene and second D gene alignments.

v_sequence_end

column in data that contains the end position of the V gene in sequence.

d_sequence_end

column in data that contains the end position of the D gene in sequence.

style

how the sequencing quality should be returned; one of "num", "phred", or "both". Specify "num" to store the quality scores as strings of comma separated numeric values. Use "phred" to have the function return the scores as Phred (ASCII) scores. Use "both" to retrieve both.

quality_sequence

specify TRUE to keep the quality scores for sequence. If false, only the quality score for sequence_alignment will be added to data.

Value

Modified data with additional fields:

  1. quality_alignment: A character vector with ASCII Phred scores for sequence_alignment.

  2. quality_alignment_num: A character vector, with comma separated numerical quality values for each position in sequence_alignment.

  3. quality: A character vector with ASCII Phred scores for sequence.

  4. quality_num: A character vector, with comma separated numerical quality values for each position in sequence.

See Also

maskPositionsByQuality and getPositionQuality

Examples

db <- airr::read_rearrangement(system.file("extdata", "example_quality.tsv", package="alakazam"))
fastq_file <- system.file("extdata", "example_quality.fastq", package="alakazam")
db <- readFastqDb(db, fastq_file, quality_offset=-33)


[Package alakazam version 1.1.0 Index]