local_taxa_tool {LocaTT}R Documentation

Perform Geographically-Conscious Taxonomic Assignment

Description

Performs taxonomic assignment of DNA metabarcoding sequences while considering geographic location.

Usage

local_taxa_tool(
  path_to_sequences_to_classify,
  path_to_BLAST_database,
  path_to_output_file,
  path_to_list_of_local_taxa = NA,
  include_missing = FALSE,
  blast_e_value = 1e-05,
  blast_max_target_seqs = 2000,
  blast_task = "megablast",
  full_names = FALSE,
  underscores = FALSE,
  separator = ", ",
  blastn_command = "blastn"
)

Arguments

path_to_sequences_to_classify

String specifying path to FASTA file containing sequences to classify. File path cannot contain spaces.

path_to_BLAST_database

String specifying path to BLAST reference database in FASTA format. File path cannot contain spaces.

path_to_output_file

String specifying path to output file of classified sequences in CSV format.

path_to_list_of_local_taxa

String specifying path to list of local species in CSV format. The file should contain the following fields: 'Common_Name', 'Domain', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species'. There should be no 'NA's or blanks in the taxonomy fields. The species field should contain the binomial name without subspecies or other information below the species level. There should be no duplicate species (i.e., multiple records with the same species binomial and taxonomy) in the local species list. If local taxa suggestions are not desired, set this variable to NA (the default).

include_missing

Logical. If TRUE, then additional fields are included in the output CSV file in which local sister taxonomic groups without reference sequences are added to the local taxa suggestions. If FALSE (the default), then this is not performed.

blast_e_value

Numeric. Maximum E-value of returned BLAST hits (lower E-values are associated with more 'significant' matches). The default is 1e-05.

blast_max_target_seqs

Numeric. Maximum number of BLAST target sequences returned per query sequence. Enough target sequences should be returned to ensure that all minimum E-value matches are returned for each query sequence. A warning will be produced if this value is not sufficient. The default is 2000.

blast_task

String specifying BLAST task specification. Use 'megablast' (the default) to find very similar sequences (e.g., intraspecies or closely related species). Use 'blastn-short' for sequences shorter than 50 bases. See the blastn program help documentation for additional options and details.

full_names

Logical. If TRUE, then full taxonomies are returned in the output CSV file. If FALSE (the default), then only the lowest taxonomic levels (e.g., species binomials instead of the full species taxonomies) are returned in the output CSV file.

underscores

Logical. If TRUE, then taxa names in the output CSV file use underscores instead of spaces. If FALSE (the default), then taxa names in the output CSV file use spaces.

separator

String specifying the separator to use between taxa names in the output CSV file. The default is ', '.

blastn_command

String specifying path to the blastn program. The default ('blastn') should work for standard BLAST installations. The user can provide a path to the blastn program for non-standard BLAST installations.

Details

Sequences are BLASTed against a global reference database, and the tool suggests locally occurring species which are most closely related (by taxonomy) to any of the best-matching BLAST hits (by bit score). Optionally, local sister taxonomic groups without reference sequences can be added to the local taxa suggestions by setting the include_missing argument to TRUE. If a local taxa list is not provided, then local taxa suggestions will be disabled, but all best-matching BLAST hits will still be returned. Alternatively, a reference database containing just the sequences of local species can be used, and local taxa suggestions can be disabled to return all best BLAST matches of local species. The reference database should be formatted with the format_reference_database function, and the local taxa lists can be prepared using the get_taxonomies.species_binomials and get_taxonomies.IUCN functions. Output field definitions are:

Value

No return value. Writes an output CSV file with fields defined in the details section.

References

A manuscript describing this taxonomic assignment method is in preparation.

Examples


# Get path to example query sequences FASTA file.
path_to_query_sequences<-system.file("extdata",
                                     "example_query_sequences.fasta",
                                     package="LocaTT",
                                     mustWork=TRUE)

# Get path to example reference database FASTA file.
path_to_reference_database<-system.file("extdata",
                                        "example_blast_database.fasta",
                                        package="LocaTT",
                                        mustWork=TRUE)

# Get path to example local taxa list CSV file.
path_to_local_taxa_list<-system.file("extdata",
                                     "example_local_taxa_list.csv",
                                     package="LocaTT",
                                     mustWork=TRUE)

# Create a temporary file path for the output CSV file.
path_to_output_CSV_file<-tempfile(fileext=".csv")

# Run the local taxa tool.
local_taxa_tool(path_to_sequences_to_classify=path_to_query_sequences,
                path_to_BLAST_database=path_to_reference_database,
                path_to_output_file=path_to_output_CSV_file,
                path_to_list_of_local_taxa=path_to_local_taxa_list,
                include_missing=TRUE,
                full_names=TRUE,
                underscores=TRUE)


[Package LocaTT version 1.1.1 Index]