local_taxa_tool {LocaTT} | R Documentation |
Perform Geographically-Conscious Taxonomic Assignment
Description
Performs taxonomic assignment of DNA metabarcoding sequences while considering geographic location.
Usage
local_taxa_tool(
path_to_sequences_to_classify,
path_to_BLAST_database,
path_to_output_file,
path_to_list_of_local_taxa = NA,
include_missing = FALSE,
blast_e_value = 1e-05,
blast_max_target_seqs = 2000,
blast_task = "megablast",
full_names = FALSE,
underscores = FALSE,
separator = ", ",
blastn_command = "blastn"
)
Arguments
path_to_sequences_to_classify |
String specifying path to FASTA file containing sequences to classify. File path cannot contain spaces. |
path_to_BLAST_database |
String specifying path to BLAST reference database in FASTA format. File path cannot contain spaces. |
path_to_output_file |
String specifying path to output file of classified sequences in CSV format. |
path_to_list_of_local_taxa |
String specifying path to list of local species in CSV format. The file should contain the following fields: 'Common_Name', 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'. There should be no 'NA's or blanks in the taxonomy fields. The species field should contain the binomial name without subspecies or other information below the species level. There should be no duplicate species (i.e., multiple records with the same species binomial and taxonomy) in the local species list. If local taxa suggestions are not desired, set this variable to |
include_missing |
Logical. If |
blast_e_value |
Numeric. Maximum E-value of returned BLAST hits (lower E-values are associated with more 'significant' matches). The default is |
blast_max_target_seqs |
Numeric. Maximum number of BLAST target sequences returned per query sequence. Enough target sequences should be returned to ensure that all minimum E-value matches are returned for each query sequence. A warning will be produced if this value is not sufficient. The default is |
blast_task |
String specifying BLAST task specification. Use |
full_names |
Logical. If |
underscores |
Logical. If |
separator |
String specifying the separator to use between taxa names in the output CSV file. The default is |
blastn_command |
String specifying path to the blastn program. The default ( |
Details
Sequences are BLASTed against a global reference database, and the tool suggests locally occurring species which are most closely related (by taxonomy) to any of the best-matching BLAST hits (by bit score). Optionally, local sister taxonomic groups without reference sequences can be added to the local taxa suggestions by setting the include_missing
argument to TRUE
. If a local taxa list is not provided, then local taxa suggestions will be disabled, but all best-matching BLAST hits will still be returned. Alternatively, a reference database containing just the sequences of local species can be used, and local taxa suggestions can be disabled to return all best BLAST matches of local species. The reference database should be formatted with the format_reference_database
function, and the local taxa lists can be prepared using the get_taxonomies.species_binomials
and get_taxonomies.IUCN
functions. Output field definitions are:
Sequence_name: The query sequence name.
Sequence: The query sequence.
Best_match_references: Species binomials of all best-matching BLAST hits (by bit score) from the reference database.
Best_match_E_value: The E-value associated with the best-matching BLAST hits.
Best_match_bit_score: The bit score associated with the best-matching BLAST hits.
Best_match_query_cover.mean: The mean query cover of all best-matching BLAST hits.
Best_match_query_cover.SD: The standard deviation of query cover of all best-matching BLAST hits.
Best_match_PID.mean: The mean percent identity of all best-matching BLAST hits.
Best_match_PID.SD: The standard deviation of percent identity of all best-matching BLAST hits.
Local_taxa (Field only present if a path to a local taxa list is provided): The finest taxonomic unit(s) which include both any species of the best-matching BLAST hits and any local species. If the species of any of the best-matching BLAST hits are local, then the finest taxonomic unit(s) are at the species level.
Local_species (Field only present if a path to a local taxa list is provided): Species binomials of all local species which belong to the taxonomic unit(s) in the Local_taxa field.
Local_taxa.include_missing (Field only present if both a path to a local taxa list is provided and the
include_missing
argument is set toTRUE
): Local sister taxonomic groups without reference sequences are added to the local taxa suggestions from the Local_taxa field.Local_species.include_missing (Field only present if both a path to a local taxa list is provided and
include_missing
argument is set toTRUE
): Species binomials of all local species which belong to the taxonomic unit(s) in the Local_taxa.include_missing field.
Value
No return value. Writes an output CSV file with fields defined in the details section.
References
A manuscript describing this taxonomic assignment method is in preparation.
Examples
# Get path to example query sequences FASTA file.
path_to_query_sequences<-system.file("extdata",
"example_query_sequences.fasta",
package="LocaTT",
mustWork=TRUE)
# Get path to example reference database FASTA file.
path_to_reference_database<-system.file("extdata",
"example_blast_database.fasta",
package="LocaTT",
mustWork=TRUE)
# Get path to example local taxa list CSV file.
path_to_local_taxa_list<-system.file("extdata",
"example_local_taxa_list.csv",
package="LocaTT",
mustWork=TRUE)
# Create a temporary file path for the output CSV file.
path_to_output_CSV_file<-tempfile(fileext=".csv")
# Run the local taxa tool.
local_taxa_tool(path_to_sequences_to_classify=path_to_query_sequences,
path_to_BLAST_database=path_to_reference_database,
path_to_output_file=path_to_output_CSV_file,
path_to_list_of_local_taxa=path_to_local_taxa_list,
include_missing=TRUE,
full_names=TRUE,
underscores=TRUE)