sq-class {tidysq} | R Documentation |
sq: class for keeping biological sequences tidy
Description
An object of class sq represents a list of biological sequences. It is the main internal format of the tidysq package and most functions operate on it. The storage method is memory-optimized so that objects require as little memory as possible (details below).
Construction/reading/import of sq objects
There are multiple ways of obtaining sq
objects:
constructing from another object with
as.sq
method,reading from the FASTA file with
read_fasta
,importing from a format of other package like ape or Biostrings with
import_sq
.
Important note: A manual assignment of a class sq
to an
object is strongly discouraged - due to the usage of low-level
functions for bit packing such assignment may lead to calling one of those
functions during operating on object or even printing it which can cause
a crash of R session and, in consequence, loss of data.
Export/writing of sq objects
There are multiple ways of saving sq
objects or converting them into
other formats:
converting into a character vector with
as.character
method,converting into a character matrix with
as.matrix
method,saving as FASTA file with
write_fasta
,exporting into a format of other package like
ape
orBiostrings
withexport_sq
.
Ambiguous letters
This package is meant to handle amino acid, DNA and RNA sequences. IUPAC
standard for one letter codes includes ambiguous bases that are used to
describe more than one basic standard base. For example, "B
" in the
context of DNA code means "any of C, G or T". As there are operations that
make sense only for unambiguous bases (like translate
), this
package has separate types for sequences with "basic" and "extended"
alphabet.
Types of sq
There is need to differentiate sq
objects that keep different types
of sequences (DNA, RNA, amino acid), as they use different alphabets.
Furthermore, there are special types for handling non-standard sequence
formats.
Each sq object has exactly one of types:
-
ami_bsc - (amino acids) represents a list of sequences of amino acids (peptides or proteins),
-
ami_ext - same as above, but with possible usage of ambiguous letters,
-
dna_bsc - (DNA) represents a list of DNA sequences,
-
dna_ext - same as above, but with possible usage of ambiguous letters,
-
rna_bsc - (RNA) represents a list of RNA sequences (together with DNA above often collectively called "nucleotide sequences"),
-
rna_ext - same as above, but with possible usage of ambiguous letters,
-
unt - (untyped) represents a list of sequences that do not have specified type. They are mainly result of reading sequences from a file that contains some letters that are not in standard nucleotide or amino acid alphabets and user has not specified them explicitly. They should be converted to other sq classes (using functions like
substitute_letters
ortypify
), -
atp - (atypical) represents sequences that have an alphabet different from standard alphabets - similarly to unt, but user has been explicitly informed about it. They are result of constructing sequences or reading from file with provided custom alphabet (for details see
read_fasta
andsq
function). They are also result of using functionsubstitute_letters
- users can use it to for example simplify an alphabet and replace several letters by one.
For clarity, ami_bsc and ami_ext types are often referred to collectively as ami when there is no need to explicitly specify every possible type. The same applies to dna and rna.
sq
object type is printed when using overloaded method
print
. It can be also checked and obtained as
a value (that may be passed as argument to function) by using
sq_type
.
Alphabet
See alphabet
.
The user can obtain an alphabet of the sq
object using the
alphabet
function. The user can check which letters are
invalid (i.e. not represented in standard amino acid or nucleotide
alphabet) in each sequence of given sq
object by using
find_invalid_letters
. To substitute one letter with another
use substitute_letters
.
Missing/Not Available values
There is a possibility of introducing NA
values into
sequences. NA
value does not represents gap (which are represented by
"-
") or wildcard elements ("N
" in the case of nucleotides and
"X
" in the case of amino acids), but is used as a representation of
an empty position or invalid letters (not represented in nucleotide or amino
acid alphabet).
NA
does not belong to any alphabet. It is printed as "!
" and,
thus, it is highly unrecommended to use "!
" as special letter in
atp sequences (but print character can be changed in options, see
tidysq-options
).
NA
might be introduced by:
reading fasta file with non-standard letters with
read_fasta
withsafe_mode
argument set toTRUE
,replacing a letter with
NA
value withsubstitute_letters
,subsetting sequences beyond their lengths with
bite
.
The user can convert sequences that contain NA
values into
NULL
sequences with remove_na
.
NULL (empty) sequences
NULL
sequence is a sequence of length 0.
NULL
sequences might be introduced by:
constructing
sq
object from character string of length zero,using the
remove_ambiguous
function,using the
remove_na
function,subsetting
sq
object withbite
function (and negative indices that span at least-1:-length(sequence)
.
Storage format
sq
object is, in fact, list of raw vectors. The fact that it
is list implies that the user can concatenate sq
objects using
c
method and subset them using
extract operator
. Alphabet is kept as an
attribute of the object.
Raw vectors are the most efficient way of storage - each letter of a
sequence is assigned an integer (its index in alphabet of sq
object).
Those integers in binary format fit in less than 8 bits, but normally are
stored on 16 bits. However, thanks to bit packing it is possible to remove
unused bits and store numbers more tightly. This means that all operations
must either be implemented with this packing in mind or accept a little time
overhead induced by unpacking and repacking sequences. However, this cost
is relatively low in comparison to amount of saved memory.
For example - dna_bsc alphabet consists of 5 values: ACGT-. They
are assigned numbers 0 to 4 respectively. Those numbers in binary format
take form: 000
, 001
, 010
, 011
, 100
. Each
of these letters can be coded with just 3 bits instead of 8 which is
demanded by char
- this allows us to save more than 60% of memory
spent on storage of basic nucleotide sequences.
tibble compatibility
sq
objects are compatible with tibble
class -
that means one can have an sq
object as a column of a tibble
.
There are overloaded print methods, so that it is printed in pretty format.