| sq-class {tidysq} | R Documentation |
sq: class for keeping biological sequences tidy
Description
An object of class sq represents a list of biological sequences. It is the main internal format of the tidysq package and most functions operate on it. The storage method is memory-optimized so that objects require as little memory as possible (details below).
Construction/reading/import of sq objects
There are multiple ways of obtaining sq objects:
constructing from another object with
as.sqmethod,reading from the FASTA file with
read_fasta,importing from a format of other package like ape or Biostrings with
import_sq.
Important note: A manual assignment of a class sq to an
object is strongly discouraged - due to the usage of low-level
functions for bit packing such assignment may lead to calling one of those
functions during operating on object or even printing it which can cause
a crash of R session and, in consequence, loss of data.
Export/writing of sq objects
There are multiple ways of saving sq objects or converting them into
other formats:
converting into a character vector with
as.charactermethod,converting into a character matrix with
as.matrixmethod,saving as FASTA file with
write_fasta,exporting into a format of other package like
apeorBiostringswithexport_sq.
Ambiguous letters
This package is meant to handle amino acid, DNA and RNA sequences. IUPAC
standard for one letter codes includes ambiguous bases that are used to
describe more than one basic standard base. For example, "B" in the
context of DNA code means "any of C, G or T". As there are operations that
make sense only for unambiguous bases (like translate), this
package has separate types for sequences with "basic" and "extended"
alphabet.
Types of sq
There is need to differentiate sq objects that keep different types
of sequences (DNA, RNA, amino acid), as they use different alphabets.
Furthermore, there are special types for handling non-standard sequence
formats.
Each sq object has exactly one of types:
-
ami_bsc - (amino acids) represents a list of sequences of amino acids (peptides or proteins),
-
ami_ext - same as above, but with possible usage of ambiguous letters,
-
dna_bsc - (DNA) represents a list of DNA sequences,
-
dna_ext - same as above, but with possible usage of ambiguous letters,
-
rna_bsc - (RNA) represents a list of RNA sequences (together with DNA above often collectively called "nucleotide sequences"),
-
rna_ext - same as above, but with possible usage of ambiguous letters,
-
unt - (untyped) represents a list of sequences that do not have specified type. They are mainly result of reading sequences from a file that contains some letters that are not in standard nucleotide or amino acid alphabets and user has not specified them explicitly. They should be converted to other sq classes (using functions like
substitute_lettersortypify), -
atp - (atypical) represents sequences that have an alphabet different from standard alphabets - similarly to unt, but user has been explicitly informed about it. They are result of constructing sequences or reading from file with provided custom alphabet (for details see
read_fastaandsqfunction). They are also result of using functionsubstitute_letters- users can use it to for example simplify an alphabet and replace several letters by one.
For clarity, ami_bsc and ami_ext types are often referred to collectively as ami when there is no need to explicitly specify every possible type. The same applies to dna and rna.
sq object type is printed when using overloaded method
print. It can be also checked and obtained as
a value (that may be passed as argument to function) by using
sq_type.
Alphabet
See alphabet.
The user can obtain an alphabet of the sq object using the
alphabet function. The user can check which letters are
invalid (i.e. not represented in standard amino acid or nucleotide
alphabet) in each sequence of given sq object by using
find_invalid_letters. To substitute one letter with another
use substitute_letters.
Missing/Not Available values
There is a possibility of introducing NA values into
sequences. NA value does not represents gap (which are represented by
"-") or wildcard elements ("N" in the case of nucleotides and
"X" in the case of amino acids), but is used as a representation of
an empty position or invalid letters (not represented in nucleotide or amino
acid alphabet).
NA does not belong to any alphabet. It is printed as "!" and,
thus, it is highly unrecommended to use "!" as special letter in
atp sequences (but print character can be changed in options, see
tidysq-options).
NA might be introduced by:
reading fasta file with non-standard letters with
read_fastawithsafe_modeargument set toTRUE,replacing a letter with
NAvalue withsubstitute_letters,subsetting sequences beyond their lengths with
bite.
The user can convert sequences that contain NA values into
NULL sequences with remove_na.
NULL (empty) sequences
NULL sequence is a sequence of length 0.
NULL sequences might be introduced by:
constructing
sqobject from character string of length zero,using the
remove_ambiguousfunction,using the
remove_nafunction,subsetting
sqobject withbitefunction (and negative indices that span at least-1:-length(sequence).
Storage format
sq object is, in fact, list of raw vectors. The fact that it
is list implies that the user can concatenate sq objects using
c method and subset them using
extract operator. Alphabet is kept as an
attribute of the object.
Raw vectors are the most efficient way of storage - each letter of a
sequence is assigned an integer (its index in alphabet of sq object).
Those integers in binary format fit in less than 8 bits, but normally are
stored on 16 bits. However, thanks to bit packing it is possible to remove
unused bits and store numbers more tightly. This means that all operations
must either be implemented with this packing in mind or accept a little time
overhead induced by unpacking and repacking sequences. However, this cost
is relatively low in comparison to amount of saved memory.
For example - dna_bsc alphabet consists of 5 values: ACGT-. They
are assigned numbers 0 to 4 respectively. Those numbers in binary format
take form: 000, 001, 010, 011, 100. Each
of these letters can be coded with just 3 bits instead of 8 which is
demanded by char - this allows us to save more than 60% of memory
spent on storage of basic nucleotide sequences.
tibble compatibility
sq objects are compatible with tibble class -
that means one can have an sq object as a column of a tibble.
There are overloaded print methods, so that it is printed in pretty format.