R: Preparing FASTA files for pan-genomics

panPrep {micropan}

R Documentation

Preparing FASTA files for pan-genomics

Description

Preparing a FASTA file before starting comparisons of sequences.

Usage

panPrep(in.file, genome_id, out.file, protein = TRUE, min.length = 10, discard = "")

Arguments

`in.file`	The name of a FASTA formatted file with protein or nucleotide sequences for coding genes in a genome.
`genome_id`	The Genome Identifier, see below.
`out.file`	Name of file where the prepared sequences will be written.
`protein`	Logical, indicating if the ‘⁠in.file⁠’ contains protein (`TRUE`) or nucleotide (`FALSE`) sequences.
`min.length`	Minimum sequence length
`discard`	A text, a regular expression, and sequences having a match against this in their headerline will be discarded.

Details

This function will read the in.file and produce another, slightly modified, FASTA file which is prepared for the comparisons using blastpAllAll, hmmerScan or any other method.

The main purpose of panPrep is to make certain every sequence is labeled with a tag called a ‘⁠genome_id⁠’ identifying the genome from which it comes. This text contains the text “GID” followed by an integer. This integer can be any integer as long as it is unique to every genome in the study. If a genome has the text “GID12345” as identifier, then the sequences in the file produced by panPrep will have headerlines starting with “GID12345_seq1”, “GID12345_seq2”, “GID12345_seq3”...etc. This makes it possible to quickly identify which genome every sequence belongs to.

The ‘⁠genome_id⁠’ is also added to the file name specified in ‘⁠out.file⁠’. For this reason the ‘⁠out.file⁠’ must have a file extension containing letters only. By convention, we expect FASTA files to have one of the extensions ‘⁠.fsa⁠’, ‘⁠.faa⁠’, ‘⁠.fa⁠’ or ‘⁠.fasta⁠’.

panPrep will also remove sequences shorter than min.length, removing stop codon symbols (‘⁠*⁠’), replacing alien characters with ‘⁠X⁠’ and converting all sequences to upper-case. If the input ‘⁠discard⁠’ contains a regular expression, any sequences having a match to this in their headerline are also removed. Example: If we use the prodigal software (see findGenes) to find proteins in a genome, partially predicted genes will have the text ‘⁠partial=10⁠’ or ‘⁠partial=01⁠’ in their headerline. Using ‘⁠discard= "partial=01|partial=10"⁠’ will remove these from the data set.

Value

This function produces a FASTA formatted sequence file, and returns the name of this file.

Author(s)

Lars Snipen and Kristian Liland.

Examples

# Using a protein file in this package
# We need to uncompress it first...
pf <- file.path(path.package("micropan"),"extdata","xmpl.faa.xz")
prot.file <- tempfile(fileext = ".xz")
ok <- file.copy(from = pf, to = prot.file)
prot.file <- xzuncompress(prot.file)

# Prepping it, using the genome_id "GID123"
prepped.file <- panPrep(prot.file, genome_id = "GID123", out.file = tempfile(fileext = ".faa"))

# Reading the prepped file
prepped <- readFasta(prepped.file)
head(prepped)

# ...and cleaning...
ok <- file.remove(prot.file, prepped.file)

[Package micropan version 2.1 Index]