sq {tidysq}R Documentation

Construct sq object from character vector

Description

This function allows the user to construct objects of class sq from a character vector.

Usage

sq(
  x,
  alphabet = NULL,
  NA_letter = getOption("tidysq_NA_letter"),
  safe_mode = getOption("tidysq_safe_mode"),
  on_warning = getOption("tidysq_on_warning"),
  ignore_case = FALSE
)

Arguments

x

[character]
Vector to construct sq object from.

alphabet

[character]
If provided value is a single string, it will be interpreted as type (see details). If provided value has length greater than one, it will be treated as atypical alphabet for sq object and sq type will be atp. If provided value is NULL, type guessing will be performed (see details).

NA_letter

[character(1)]
A string that is used to interpret and display NA value in the context of sq class. Default value equals to "!".

safe_mode

[logical(1)]
Default value is FALSE. When turned on, safe mode guarantees that NA appears within a sequence if and only if input sequence contains value passed with NA_letter. This means that resulting type might be different to the one passed as argument, if there are letters in a sequence that does not appear in the original alphabet.

on_warning

["silent" || "message" || "warning" || "error"]
Determines the method of handling warning message. Default value is "warning".

ignore_case

[logical(1)]
If turned on, lowercase letters are turned into respective uppercase ones and interpreted as such. If not, either sq object must be of type unt or all lowercase letters are interpreted as NA values. Default value is FALSE. Ignoring case does not work with atp alphabets.

Details

Function sq covers all possibilities of standard and non-standard types and alphabets. You can check what 'type' and 'alphabet' exactly are in sq class documentation. There is a guide below on how function operates and how the program behaves depending on arguments passed and letters in the sequences.

x parameter should be a character vector. Each element of this vector is a biological sequence. If this parameter has length 0, object of class sq with 0 sequences will be created (if not specified, it will have dna_bsc type, which is a result of rules written below). If it contains sequences of length 0, NULL sequences will be introduced (see NULL (empty) sequences section in sq class).

Important note: in all below cases word 'letter' stands for an element of an alphabet. Letter might consist of more than one character, for example "Ala" might be a single letter. However, if the user wants to construct or read sequences with multi-character letters, one has to specify all letters in alphabet parameter. Details of letters, alphabet and types can be found in sq class documentation.

Value

An object of class sq with appropriate type.

Simple guide to construct

In many cases, just the x parameter needs to be specified - type of sequences will be guessed according to rules described below. The user needs to pay attention, however, because for short sequences type may be guessed incorrectly - in this case they should specify type in alphabet parameter.

If your sequences contain non-standard letters, where each non-standard letter is one character long (that is, any character that is not an uppercase letter), you also don't need to specify any parameter. Optionally, you can explicitly do it by setting alphabet to "unt".

In safe mode it is guaranteed that only letters which are equal to NA_letter argument are interpreted as NA values. Due to that, resulting alphabet might be different from the alphabet argument.

Detailed guide to construct

Below are listed all possibilities that can occur during the construction of a sq object:

If ignore_case is set to TRUE, then lowercase letters are turned into uppercase during their interpretation, unless type is set to atp.

Handling unt and atp types and NA values

You can convert letters into another using substitute_letters and then use typify or sq_type<- function to set type of sq to dna_bsc, dna_ext, rna_bsc, rna_ext, ami_bsc or ami_ext. If your sequences contain NA values, use remove_na.

See Also

Functions from input module: import_sq(), random_sq(), read_fasta()

Examples

# constructing sq without specifying alphabet:
# Correct sq type will be guessed from appearing letters
## dna_bsc
sq(c("ATGC", "TCGTTA", "TT--AG"))

## rna_bsc
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"))

## ami_bsc
sq(c("YQQPAVVM", "PQCFL"))

## ami cln sq can contain "*" - a letter meaning end of translation:
sq(c("MMDF*", "SYIHR*", "MGG*"))

## dna_ext
sq(c("TMVCCDA", "BASDT-CNN"))

## rna_ext
sq(c("WHDHKYN", "GCYVCYU"))

## ami_ext
sq(c("XYOQWWKCNJLO"))

## unt - assume that one wants to mark some special element in sequence with "%"
sq(c("%%YAPLAA", "PLAA"))

# passing type as alphabet parameter:
# All above examples yield an identical result if type specified is the same as guessed
sq(c("ATGC", "TCGTTA", "TT--AG"), "dna_bsc")
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), "rna_bsc")
sq(c("YQQPAVVM", "PQCFL"), "ami_bsc")
sq(c("MMDF*", "SYIHR*", "MGG*"), "ami_bsc")
sq(c("TMVCCDA", "BASDT-CNN"), "dna_ext")
sq(c("WHDHKYN", "GCYVCYU"), "rna_ext")
sq(c("XYOQWWKCNJLO"), "ami_ext")
sq(c("%%YAPLAA", "PLAA"), "unt")

# Type doesn't have to be the same as the guessed one if letters fit in the destination alphabet
sq(c("ATGC", "TCGTTA", "TT--AG"), "dna_ext")
sq(c("ATGC", "TCGTTA", "TT--AG"), "ami_bsc")
sq(c("ATGC", "TCGTTA", "TT--AG"), "ami_ext")
sq(c("ATGC", "TCGTTA", "TT--AG"), "unt")

# constructing sq with specified letters of alphabet:
# In sequences below "mA" denotes methyled alanine - two characters are treated as single letter
sq(c("LmAQYmASSR", "LmASMKLKFmAmA"), alphabet = c("mA", LETTERS))
# Order of alphabet letters are not meaningful in most cases
sq(c("LmAQYmASSR", "LmASMKLKFmAmA"), alphabet = c(LETTERS, "mA"))

# reading sequences with three-letter names:
sq(c("ProProGlyAlaMetAlaCys"), alphabet = c("Pro", "Gly", "Ala", "Met", "Cys"))

# using safe mode:
# Safe mode guarantees that no element is read as NA
# But resulting alphabet might be different to the passed one (albeit with warning/error)
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), alphabet = "dna_bsc", safe_mode = TRUE)
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), alphabet = "dna_bsc")

# Safe mode guesses alphabet based on whole sequence
long_sequence <- paste0(paste0(rep("A", 4500), collapse = ""), "N")
sq(long_sequence, safe_mode = TRUE)
sq(long_sequence)

# ignoring case:
# By default, lower- and uppercase letters are treated separately
# This behavior can be changed by setting ignore_case = TRUE
sq(c("aTGc", "tcgTTA", "tt--AG"), ignore_case = TRUE)
sq(c("XYOqwwKCNJLo"), ignore_case = TRUE)

# It is possible to construct sq with length 0
sq(character())

# As well as sq with empty sequences
sq(c("AGTGGC", "", "CATGA", ""))


[Package tidysq version 1.1.3 Index]