barcode_clean {MACER}R Documentation

DNA Barcode Clean

Description

Takes an input fasta file and identifies genus level outliers and species outliers based on the 1.5 x greater than the interquartile range. It also, if selected, checks the sequence using amino acid translation and has the option to eliminate sequences that have non-IUPAC codes. Finally, the program calculates the barcode gap for the species in the submitted dataset.

Usage

barcode_clean(
  AA_code = "invert",
  AGCT_only = TRUE,
  data_folder = NULL,
  outliers = TRUE,
  dist_model = "raw",
  replicates = 1000,
  replacement = TRUE,
  conf_level = 1,
  numCores = 1
)

Arguments

AA_code

This is the amino acid translation matrix (as implemented through ape) used to check the sequences for stop codons. The following codes are available std, vert, invert, F. The default is invert.

AGCT_only

This indicates if records with characters other than AGCT are kept, the default is TRUE. TRUE removes records with non-AGCT FALSE is accepting all IUPAC characters

data_folder

This variable can be used to provide a location for the MSA fasta files to be cleaned. The default value is set to NULL where the program will prompt the user to select the folder through point-and-click.

outliers

This is the variable to indicate if the use would like to remove suspected sequence record outliers using 1.5X the genetic distance. If set to TRUE genus and species level outliers will be removed. If FALSE this will not occur. Default TRUE.

dist_model

This is the model of nucleotide evolution that the ape program will use (see ape documentation for options. Default is "raw"

replicates

This is the number of replicates that the bootstrapping will perform. Note: more replicates will take longer. Default is 1000

replacement

This indicates that the replacement of MSA nucleotide columns will be replaced in the random resampling. Default is set to TRUE

conf_level

This is a percentage of the initial MSA nucleotide length. When set to 1 the bootstrapped resampling will have the same length as the initial MSA. Default is set to 1

numCores

This is the number of cores that the user would like to use where multithreading is available. Default is set to 1, indicating only a single thread will be used.

Details

Input: A file folder with one or more fasta files of interest

Value

Output: A single log file for the running of the function with the name A_Clean_File_YYYY-DD-TTTTTTTT. The function will also output three files for each fasta file submitted. The first is the distance matrix that was calculated and used to assess the DNA barcode gaps. This file is named the same as the input file with dist_table.dat appended to the end of the name. The second file is the total data table file which provides a table of all submitted records for each data set accompanied with the results from each section of the analysis. This file is named the same as the input fasta with data_table.dat appended to the end, Finally, a fasta file with all outliers and flagged records removed is generated for each input fasta file. This output file is named the same as the input fasta with no_outlier.fas appended to the end.

Author(s)

Robert G. Young

References

<https://github.com/rgyoung6/MACER> Young RG, Gill R, Gillis D, Hanner RH (2021) Molecular Acquisition, Cleaning and Evaluation in R (MACER) - A tool to assemble molecular marker datasets from BOLD and GenBank. Biodiversity Data Journal 9: e71378. <https://doi.org/10.3897/BDJ.9.e71378>

See Also

auto_seq_download() create_fastas() align_to_ref()

Examples

## Not run: 
barcode_clean(),
barcode_clean(AA_code = "vert", AGCT_only = TRUE),
barcode_clean(AA_code = "vert")

## End(Not run)


[Package MACER version 0.2.1 Index]