sq {tidysq} | R Documentation |
Construct sq object from character vector
Description
This function allows the user to construct objects of
class sq
from a character vector.
Usage
sq(
x,
alphabet = NULL,
NA_letter = getOption("tidysq_NA_letter"),
safe_mode = getOption("tidysq_safe_mode"),
on_warning = getOption("tidysq_on_warning"),
ignore_case = FALSE
)
Arguments
x |
[ |
alphabet |
[ |
NA_letter |
[ |
safe_mode |
[ |
on_warning |
[ |
ignore_case |
[ |
Details
Function sq
covers all possibilities of standard and non-standard
types and alphabets. You can check what 'type' and 'alphabet' exactly are in
sq class
documentation. There is a guide below on
how function operates and how the program behaves depending on arguments
passed and letters in the sequences.
x
parameter should be a character vector. Each element of this vector
is a biological sequence. If this parameter has length 0, object of class
sq
with 0 sequences will be created (if not specified, it will have
dna_bsc type, which is a result of rules written below). If it
contains sequences of length 0, NULL
sequences will be introduced (see
NULL (empty) sequences section in sq class
).
Important note: in all below cases word 'letter' stands for an
element of an alphabet. Letter might consist of more than one character, for
example "Ala
" might be a single letter. However, if the user wants to
construct or read sequences with multi-character letters, one has to specify
all letters in alphabet
parameter. Details of letters, alphabet and
types can be found in sq class
documentation.
Value
An object of class sq
with appropriate type.
Simple guide to construct
In many cases, just the x
parameter needs to be specified - type of
sequences will be guessed according to rules described below. The user needs
to pay attention, however, because for short sequences type may be guessed
incorrectly - in this case they should specify type in alphabet
parameter.
If your sequences contain non-standard letters, where each non-standard
letter is one character long (that is, any character that is not an uppercase
letter), you also don't need to specify any parameter. Optionally, you can
explicitly do it by setting alphabet
to "unt"
.
In safe mode
it is guaranteed that only letters which are equal to
NA_letter
argument are interpreted as NA
values. Due to that,
resulting alphabet might be different from the alphabet
argument.
Detailed guide to construct
Below are listed all possibilities that can occur during the construction of
a sq
object:
If you don't specify any other parameter than
x
, function will try to guess sequence type (it will check in exactly this order):If it contains only ACGT- letters, type will be set to dna_bsc.
If it contains only ACGU- letters, type will be set to rna_bsc.
If it contains any letters from 1. and 2. and additionally letters DEFHIKLMNPQRSVWY*, type will be set to ami_bsc.
If it contains any letters from 1. and additionally letters WSMKRYBDHVN, type will be set to dna_ext.
If it contains any letters from 2. and additionally letters WSMKRYBDHVN, type will be set to rna_ext.
If it contains any letters from previous points and additionally letters JOUXZ, type will be set to ami_ext.
If it contains any letters that exceed all groups mentioned above, type will be set to unt.
If you specify
alphabet
parameter as any of"dna_bsc"
,"dna_ext"
,"rna_bsc"
,"rna_ext"
,"ami_bsc"
,"ami_ext"
; then:If
safe_mode
isFALSE
, then sequences will be built with standard alphabet for given type.If
safe_mode
isTRUE
, then sequences will be scanned for letters not in standard alphabet:If no such letters are found, then sequences will be built with standard alphabet for given type.
If at least one such letter is found, then sequences are built with real alphabet and with type set to unt.
If you specify
alphabet
parameter as"unt"
, then sequences are scanned for alphabet and subsequently built with obtained alphabet and type unt.If you specify
alphabet
parameter ascharacter
vector longer than 1, then type is set to atp and alphabet is equal to letters in said parameter.
If ignore_case
is set to TRUE
, then lowercase letters are
turned into uppercase during their interpretation, unless type is set to
atp.
Handling unt and atp types and NA
values
You can convert letters into another using substitute_letters
and then use typify
or sq_type<-
function to set type of
sq
to dna_bsc, dna_ext, rna_bsc,
rna_ext, ami_bsc or ami_ext. If your sequences
contain NA
values, use remove_na
.
See Also
Functions from input module:
import_sq()
,
random_sq()
,
read_fasta()
Examples
# constructing sq without specifying alphabet:
# Correct sq type will be guessed from appearing letters
## dna_bsc
sq(c("ATGC", "TCGTTA", "TT--AG"))
## rna_bsc
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"))
## ami_bsc
sq(c("YQQPAVVM", "PQCFL"))
## ami cln sq can contain "*" - a letter meaning end of translation:
sq(c("MMDF*", "SYIHR*", "MGG*"))
## dna_ext
sq(c("TMVCCDA", "BASDT-CNN"))
## rna_ext
sq(c("WHDHKYN", "GCYVCYU"))
## ami_ext
sq(c("XYOQWWKCNJLO"))
## unt - assume that one wants to mark some special element in sequence with "%"
sq(c("%%YAPLAA", "PLAA"))
# passing type as alphabet parameter:
# All above examples yield an identical result if type specified is the same as guessed
sq(c("ATGC", "TCGTTA", "TT--AG"), "dna_bsc")
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), "rna_bsc")
sq(c("YQQPAVVM", "PQCFL"), "ami_bsc")
sq(c("MMDF*", "SYIHR*", "MGG*"), "ami_bsc")
sq(c("TMVCCDA", "BASDT-CNN"), "dna_ext")
sq(c("WHDHKYN", "GCYVCYU"), "rna_ext")
sq(c("XYOQWWKCNJLO"), "ami_ext")
sq(c("%%YAPLAA", "PLAA"), "unt")
# Type doesn't have to be the same as the guessed one if letters fit in the destination alphabet
sq(c("ATGC", "TCGTTA", "TT--AG"), "dna_ext")
sq(c("ATGC", "TCGTTA", "TT--AG"), "ami_bsc")
sq(c("ATGC", "TCGTTA", "TT--AG"), "ami_ext")
sq(c("ATGC", "TCGTTA", "TT--AG"), "unt")
# constructing sq with specified letters of alphabet:
# In sequences below "mA" denotes methyled alanine - two characters are treated as single letter
sq(c("LmAQYmASSR", "LmASMKLKFmAmA"), alphabet = c("mA", LETTERS))
# Order of alphabet letters are not meaningful in most cases
sq(c("LmAQYmASSR", "LmASMKLKFmAmA"), alphabet = c(LETTERS, "mA"))
# reading sequences with three-letter names:
sq(c("ProProGlyAlaMetAlaCys"), alphabet = c("Pro", "Gly", "Ala", "Met", "Cys"))
# using safe mode:
# Safe mode guarantees that no element is read as NA
# But resulting alphabet might be different to the passed one (albeit with warning/error)
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), alphabet = "dna_bsc", safe_mode = TRUE)
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), alphabet = "dna_bsc")
# Safe mode guesses alphabet based on whole sequence
long_sequence <- paste0(paste0(rep("A", 4500), collapse = ""), "N")
sq(long_sequence, safe_mode = TRUE)
sq(long_sequence)
# ignoring case:
# By default, lower- and uppercase letters are treated separately
# This behavior can be changed by setting ignore_case = TRUE
sq(c("aTGc", "tcgTTA", "tt--AG"), ignore_case = TRUE)
sq(c("XYOqwwKCNJLo"), ignore_case = TRUE)
# It is possible to construct sq with length 0
sq(character())
# As well as sq with empty sequences
sq(c("AGTGGC", "", "CATGA", ""))