R: Breaks up genbank sequences into their annotated components...

AnnotationBust {AnnotationBustR}

R Documentation

Breaks up genbank sequences into their annotated components based on a set of search terms and writes each subsequence of interest to a FASTA for each accession number supplied.

Description

Breaks up genbank sequences into their annotated components based on a set of search terms and writes each subsequence of interest to a FASTA for each accession number supplied.

Usage

AnnotationBust(
  Accessions,
  Terms,
  Duplicates = NULL,
  DuplicateInstances = NULL,
  TranslateSeqs = FALSE,
  TranslateCode = 1,
  DuplicateSpecies = FALSE,
  Prefix = NULL,
  TidyAccessions = TRUE
)

Arguments

`Accessions`	A vector of INSDC GenBank accession numbers. Note that refseq numbers are not supported at the moment (i.e. prefixes like XM_ and NC_) will not work.
`Terms`	A data frame of search terms. Search terms for animal mitogenomes, nuclear rRNA, chloroplast genomes, and plant mitogenomes are pre-made and can be loaded using the data()function. Additional terms can be addded using the MergeSearchTerms function.
`Duplicates`	A vector of loci names that have more than one copy. Default is NULL
`DuplicateInstances`	A numeric vector the length of Duplicates of the number of duplicates for each duplicated gene you wish to extract.Default is NULL
`TranslateSeqs`	Logical as to whether or not the sequence should be translated to the corresponding peptide sequence. Default is FALSE. Note that this only makes sense to list as TRUE for protein coding sequences.
`TranslateCode`	Numerical representing the GenBank translation code for which sequences should be translated under. Default is 1. For all options see ?seqinr::getTrans. Some possible useful ones are: 2. Vertebrate Mitochondrial, 5. Invertebrate Mitochondrial, and 11. bacterial+plantplastid
`DuplicateSpecies`	Logical. As to whether there are duplicate individuals per species. If TRUE, adds the accession number to the fasta file
`Prefix`	Character If a prefix is specified, all output FASTA files written will begin with the prefix. Default is NULL.
`TidyAccessions`	Logical as to whether the accession table should have a single row per species. If numerous accessions for a species occur, they will be seperated by a comma in the accession table. Default=TRUE.

Details

The AnnotationBust function takes a vector of accession numbers and a data frame of search terms and extracts subsequences from genomes or concatenated sequences. This function requires a steady internet connection. It writes files in the FASTA format to the working directory and returns an accession table. Files append, so use different prefixes between runs, otherwise they will be added to the current files in the working directory if the names are the same. AnnoitationBustR comes with pre-made search terms for metazoan mitogenomes, plant mitogenomes, chloroplast genomes, and rDNA that can be loaded using data(mtDNAterms), data(mtDNAtermsPlants),data(cpDNAterms), and data(rDNAterms) respectively. Search terms can be completely made by the user as long as they follow a similar format with three columns. The first, Locus, should contain the name of the locus that will also be used to name the files. We recommend following a similar naming convention to what we currently have in the pre-made data frames to ensure that files are named properly, characters like "-" or ".", and names starting with numbers should be avoided as it can cause errors with R. The second column, Type, contains the type of subsequence it is, with options being CDS, rRNA, tRNA, misc_RNA, Intro, Exon, and D_Loop. The last column, Name, consists of a possible name for the locus of interest as it might appear in an annotation. For numerous synonyms for the same locus, one should have each synonym as its own row.An additional fourth column is needed for extracting introns/exons. This column, called IntronExonNumber should contain the number of the intron to extract. If extracting both introns/exons and non-intron/exon sequences the fourth column should be NA for non-intron/exon sequence types. See the examples below and the vignette for detailed examples on extracting intron and exons. It is possible that some subsequences are not fully annotated on ACNUC and, therefore, are not extractable. These will return in the accession table as "type not fully Ann". It is also possible that the sequence has no annotations at all, for which it will return "No Ann. For". If the function returns "Acc. Not Found", the accession number supplied could not be found on NCBI. If "Not On ACNUC GenBank" is returned, the accession is not available through AcNUC. This may be due to ACNUC not being fully up to date. To see the last time ACNUC was updated, run seqinr::choosebank("genbank", infobank=T).

For a more detailed walkthrough on using AnnotationBust you can access the vignette with vignette("AnnotationBustR).

Value

Writes a fasta file(s) to the current working directory selected for each unique subsequence of interest in Terms containing all the accession numbers the subsequence was found in.

An accesion table of class data.frame.

References

Borstein, Samuel R., and Brian C. O'Meara. "AnnotationBustR: An R package to extract subsequences from GenBank annotations." PeerJ Preprints 5 (2017): e2920v1.

Examples

## Not run: 
#Create vector of three NCBI accessions of rDNA toget subsequences of and load rDNA terms.
ncbi.accessions<-c("FJ706295","FJ706343","FJ706292")
data(rDNAterms)#load rDNA search terms from AnnotationBustR
#Run AnnotationBustR and write files to working directory
my.sequences<-AnnotationBust(ncbi.accessions, rDNAterms, DuplicateSpecies=TRUE,
Prefix="Example1")
my.sequences#Return the accession table for each species.

###Example With matK CDS and trnK Introns/Exons##
#Subset out matK from cpDNAterms
cds.terms<-subset(cpDNAterms,cpDNAterms$Locus=="matK")
#Create a vecotr of NA so we can merge with the search terms for introns and exons
cds.terms<-cbind(cds.terms,(rep(NA,length(cds.terms$Locus))))
colnames(cds.terms)[4]<-"IntronExonNumber"

#Prepare a search term table for the intron and exons to remove
#We can start with the cpDNAterms for trnK
IntronExon.terms<-subset(cpDNAterms,cpDNAterms$Locus=="trnK")
#As we want to go for two exons, we will want the synonyms repeated as we are doing and intron
#and an exon
IntronExon.terms<-rbind(IntronExon.terms,IntronExon.terms)#duplicate the terms
#rep the sequence type we want to extract
IntronExon.terms$Type<-rep(c("Intron","Intron","Exon","Exon"))
IntronExon.terms$Locus<-rep(c("trnK_Intron","trnK_Exon2"),each=2)
IntronExon.terms<-cbind(IntronExon.terms,rep(c(1,1,2,2)))#Add intron/exon number info
#change column name for number info for IntronExon name
colnames(IntronExon.terms)[4]<-"IntronExonNumber"
#We can then merge everything together with MergeSearchTerms terms
IntronExonExampleTerms<-MergeSearchTerms(IntronExon.terms,cds.terms)

#Run AnnotationBust
IntronExon.example<-AnnotationBust(Accessions=c("KX687911.1", "KX687910.1"),
Terms=IntronExonExampleTerms, Prefix="DemoIntronExon")

## End(Not run)

[Package AnnotationBustR version 1.3.0 Index]