primersearch {metacoder} | R Documentation |
Use EMBOSS primersearch for in silico PCR
Description
A pair of primers are aligned against a set of sequences. A
taxmap
object with two tables is returned: a table with
information for each predicted amplicon, quality of match, and predicted
amplicons, and a table with per-taxon amplification statistics. Requires the
EMBOSS tool kit (https://emboss.sourceforge.net/) to be installed.
Usage
primersearch(obj, seqs, forward, reverse, mismatch = 5, clone = TRUE)
Arguments
obj |
A |
seqs |
The sequences to do in silico PCR on. This can be any variable in
|
forward |
( |
reverse |
( |
mismatch |
An integer vector of length 1. The percentage of mismatches allowed. |
clone |
If |
Details
It can be confusing how the primer sequence relates to the binding sites on a reference database sequence. A simplified diagram can help. For example, if the top strand below (5' -> 3') is the database sequence, the forward primer has the same sequence as the target region, since it will bind to the other strand (3' -> 5') during PCR and extend on the 3' end. However, the reverse primer must bind to the database strand, so it will have to be the complement of the reference sequence. It also has to be reversed to make it in the standard 5' -> 3' orientation. Therefore, the reverse primer must be the reverse complement of its binding site on the reference sequence.
Primer 1: 5' AAGTACCTTAACGGAATTATAG 3' Primer 2: 5' GCTCCACCTACGAAACGAAT 3' <- TAAGCAAAGCATCCACCTCG 5' 5' ...AAGTACCTTAACGGAATTATAG......ATTCGTTTCGTAGGTGGAGC... 3' 3' ...TTCATGGAATTGCCTTAATATC......TAAGCAAAGCATCCACCTCG... 5' 5' AAGTACCTTAACGGAATTATAG ->
However, a database might have either the top or the bottom strand as a
reference sequence. Since one implies the sequence of the other, either is
valid, but this is another source of confusion. If we take the diagram above
and rotate it 180 degrees, it would mean the same thing, but which primer we would
want to call "forward" and which we would want to call "reverse" would
change. Databases of a single locus (e.g. Greengenes) will likely have a
convention for which strand will be present, so relative to this convention,
there is a distinct "forward" and "reverse". However, computers dont know
about this convention, so the "forward" primer is whichever primer has the
same sequence as its binding region in the database (as opposed to the
reverse complement). For this reason, primersearch will redefine which primer
is "forward" and which is "reverse" based on how it binds the reference
sequence. See the example code in primersearch_raw
for a
demonstration of this.
Value
A copy of the input taxmap
object with two tables added. One table contains amplicon information with one row per predicted amplicon with the following info:
(f_primer) 5' AAGTACCTTAACGGAATTATAG -> (r_primer) <- TAAGCAAAGCATCCACCTCG 5' 5' ...AAGTACCTTAACGGAATTATAG......ATTCGTTTCGTAGGTGGAGC... 3' ^ ^ ^ ^ f_start f_end r_rtart r_end |--------------------||----||------------------| f_match amplicon r_match |----------------------------------------------| product
- taxon_id:
The taxon IDs for the sequence.
- seq_index:
The index of the input sequence.
- f_primer:
The sequence of the forward primer.
- r_primer:
The sequence of the reverse primer.
- f_mismatch:
The number of mismatches on the forward primer.
- r_mismatch:
The number of mismatches on the reverse primer.
- f_start:
The start location of the forward primer.
- f_end:
The end location of the forward primer.
- r_start:
The start location of the reverse primer.
- r_end:
The end location of the reverse primer.
- f_match:
The sequence matched by the forward primer.
- r_match:
The sequence matched by the reverse primer.
- amplicon:
The sequence amplified by the primers, not including the primers.
- product:
The sequence amplified by the primers including the primers. This simulates a real PCR product.
The other table contains per-taxon information about the PCR, with one row per taxon. It has the following columns:
- taxon_ids:
Taxon IDs.
- query_count:
The number of sequences used as input.
- seq_count:
The number of sequences that had at least one amplicon.
- amp_count:
The number of amplicons. Might be more than one per sequence.
- amplified:
If at least one sequence of that taxon had at least one amplicon.
- multiple:
If at least one sequences had at least two amplicons.
- prop_amplified:
The proportion of sequences with at least one amplicon.
- med_amp_len:
The median amplicon length.
- min_amp_len:
The minimum amplicon length.
- max_amp_len:
The maximum amplicon length.
- med_prod_len:
The median product length.
- min_prod_len:
The minimum product length.
- max_prod_len:
The maximum product length.
Installing EMBOSS
The command-line tool "primersearch" from the EMBOSS tool kit is needed to use this function. How you install EMBOSS will depend on your operating system:
Linux:
Open up a terminal and type:
sudo apt-get install emboss
Mac OSX:
The easiest way to install EMBOSS on OSX is to use homebrew. After installing homebrew, open up a terminal and type:
brew install homebrew/science/emboss
Windows:
There is an installer for Windows here:
ftp://emboss.open-bio.org/pub/EMBOSS/windows/mEMBOSS-6.5.0.0-setup.exe
Examples
## Not run:
# Get example FASTA file
fasta_path <- system.file(file.path("extdata", "silva_subset.fa"),
package = "metacoder")
# Parse the FASTA file as a taxmap object
obj <- parse_silva_fasta(file = fasta_path)
# Simulate PCR with primersearch
# Have to replace Us with Ts in sequences since primersearch
# does not understand Us.
obj <- primersearch(obj,
gsub(silva_seq, pattern = "U", replace = "T"),
forward = c("U519F" = "CAGYMGCCRCGGKAAHACC"),
reverse = c("Arch806R" = "GGACTACNSGGGTMTCTAAT"),
mismatch = 10)
# Plot what did not ampilify
obj %>%
filter_taxa(prop_amplified < 1) %>%
heat_tree(node_label = taxon_names,
node_color = prop_amplified,
node_color_range = c("grey", "red", "purple", "green"),
node_color_trans = "linear",
node_color_axis_label = "Proportion amplified",
node_size = n_obs,
node_size_axis_label = "Number of sequences",
layout = "da",
initial_layout = "re")
## End(Not run)