read_fasta {canprot} | R Documentation |
Functions for reading FASTA files
Description
Read protein amino acid composition or sequences from a file and count numbers of amino acids in given sequences.
Usage
read_fasta(file, iseq = NULL, type = "count", lines = NULL,
ihead = NULL, start = NULL, stop = NULL, molecule = "protein", id = NULL)
count_aa(sequence, start = NULL, stop = NULL, molecule = "protein")
sum_aa(AAcomp, abundance = 1, average = FALSE)
Arguments
file |
character, path to FASTA file |
iseq |
numeric, which sequences to read from the file |
type |
character, type of return value (‘count’, ‘sequence’, ‘lines’, or ‘headers’) |
lines |
list of character, supply the lines here instead of reading them from file |
ihead |
numeric, which lines are headers |
start |
numeric, position in sequence to start counting |
stop |
numeric, position in sequence to stop counting |
molecule |
character, type of molecule (‘protein’, ‘DNA’, or ‘RNA’) |
id |
character, value to be used for |
sequence |
character, one or more sequences |
AAcomp |
data frame, amino acid composition(s) of proteins |
abundance |
numeric, abundances of proteins |
average |
logical, return the weighted average of amino acid counts? |
Details
read_fasta
is used to retrieve entries from a FASTA file.
Use iseq
to select the sequences to read (the default is all sequences).
The function returns various data formats depending on the value of type
:
- ‘count’
data frame of amino acid counts
- ‘sequence’
list of sequences
- ‘lines’
list of lines from the FASTA file (including headers)
- ‘headers’
list of header lines from the FASTA file
When type
is ‘count’, the header lines of the file are parsed to obtain protein names that are put into the protein
column in the result.
Furthermore, if a UniProt FASTA header is detected (using the regular expression "\|......\|.*_"
), the information there (accession, name, organism) is split into the protein
, abbrv
, and organism
columns of the resulting data frame.
this behavior (which may take a while for large files) can be suppressed by supplying protein names in id
.
To speed up processing, if the line numbers of the header lines were previously determined, they can be supplied in ihead
.
Optionally, the lines of a previously read file may be supplied in lines
(in this case no file is needed so file
should be set to "").
count_aa
is the underlying function that counts the numbers of each amino acid or nucleic-acid base in one or more sequences.
The matching of letters is case-insensitive.
A message is generated if any character in sequence
, excluding spaces, is not one of the single-letter amino acid or nucleobase abbreviations.
start
and/or stop
can be provided to process a fragment of the sequence.
If only one of start
or stop
is present, the other defaults to 1 (start
) or the length of the respective sequence (stop
).
sum_aa
sums the amino acid compositions in the input AAcomp
data frame.
It only applies to columns with the three-letter abbreviations of amino acids and to a column named chains
(if present).
The values in these columns are multiplied by the indicated abundance
after recycling to the number of proteins.
The values in these columns are then summed; if average
is TRUE then the sum is divided by the number of proteins.
Proteins with missing values (NA) of amino acid composition or abundance are omitted from the calculation.
The output has one row and the same number of columns as the input; the value in the non-amino acid columns is taken from the first row of the input.
Value
count_aa
returns a data frame with these columns (for proteins): Ala
, Cys
, Asp
, Glu
, Phe
, Gly
, His
, Ile
, Lys
, Leu
, Met
, Asn
, Pro
, Gln
, Arg
, Ser
, Thr
, Val
, Trp
, Tyr
.
For ‘DNA’, the columns are changed to A
, C
, G
, T
, and for ‘RNA’, the columns are changed to A
, C
, G
, U
.
read_fasta
returns a list of sequences (for type
equal to ‘sequence’) or a list of lines (for type
equal to ‘lines’ or ‘headers’).
Otherwise, (for type
equal to ‘count’) a data frame with these columns: protein
, organism
, ref
, abbrv
, chains
, and the columns described above for count_aa
.
sum_aa
returns a one-row data frame.
See Also
Pass the output of read_fasta
to add.protein
in the CHNOSZ package to set up thermodynamic calculations for proteins.
Examples
## Reading a protein FASTA file
# The path to the file
file <- system.file("extdata/fasta/KHAB17.fasta", package = "canprot")
# Read the sequences, and print the first one
read_fasta(file, type = "seq")[[1]]
# Count the amino acids in the sequences
aa <- read_fasta(file)
# Calculate protein length (number of amino acids in each protein)
plength(aa)
# Sum the amino acid compositions
sum_aa(aa)
# Count amino acids in a sequence
count_aa("GGSGG")
# A message is issued for unrecognized characters
count_aa("AAAXXX")
# Count nucleobases in a sequence
bases <- count_aa("ACCGGGTTT", molecule = "DNA")