R: Convert DNA Sequence to Indicator Vectors

dna2vector {astsa}

R Documentation

Convert DNA Sequence to Indicator Vectors

Description

Takes a string (e.g., a DNA sequence) of general form (e.g., FASTA) and converts it to a sequence of indicator vectors for use with the Spectral Envelope (specenv).

Usage

dna2vector(data, alphabet = NULL)

Arguments

`data`	A single string.
`alphabet`	The particular alphabet being used. The default is `alphabet=c("A", "C", "G", "T")`.

Details

Takes a string of categories and converts it to a matrix of indicators. The data can then be used by the script specenv, which calculates the Spectral Envelope of the sequence (or subsequence). Many different type of sequences can be used, including FASTA and GenBank, as long as the data is a string of categories.

The indicator vectors (as a matrix) are returned invisibly in case the user forgets to put the results in an object wherein the screen would scroll displaying the entire sequence. In other words, the user should do something like xdata = dna2vector(data) where data is the original sequence.

As an example, if the DNA sequence is in a FASTA file, say sequence.fasta, remove the first line, which will look like >V01555.2 ... . Then the following code can be used to read the data into the session, create the indicator sequence and save it as a compressed R data file:

  fileName <- 'sequence.fasta'      # name of FASTA file
  data     <- readChar(fileName, file.info(fileName)$size)  # input the sequence
  myseq    <- dna2vector(data)      # convert it to indicators

  ##== to compress and save the data ==##
  save(myseq, file='myseq.rda')
  ##== and then load it when needed ==##
  load('myseq.rda')

Value

matrix of indicator vectors; returned invisibly

Author(s)

D.S. Stoffer

References

You can find demonstrations of astsa capabilities at FUN WITH ASTSA.

The most recent version of the package can be found at https://github.com/nickpoison/astsa/.

In addition, the News and ChangeLog files are at https://github.com/nickpoison/astsa/blob/master/NEWS.md.

The webpages for the texts and some help on using R for time series analysis can be found at https://nickpoison.github.io/.

Examples

# Epstein-Barr virus (entire sequence included in astsa)
xdata = dna2vector(EBV)
head(xdata)

# part of EBV with  1, 2, 3, 4 for "A", "C", "G", "T"
xdata = dna2vector(bnrf1ebv)
head(xdata)

# raw GenBank sequence
data <-
c("1 agaattcgtc ttgctctatt cacccttact tttcttcttg cccgttctct ttcttagtat
  61 gaatccagta tgcctgcctg taattgttgc gccctacctc ttttggctgg cggctattgc")
xdata = dna2vector(data, alphabet=c('a', 'c', 'g', 't'))
head(xdata)

# raw FASTA sequence
data <-
 c("AGAATTCGTCTTGCTCTATTCACCCTTACTTTTCTTCTTGCCCGTTCTCTTTCTTAGTATGAATCCAGTA
    TGCCTGCCTGTAATTGTTGCGCCCTACCTCTTTTGGCTGGCGGCTATTGCCGCCTCGTGTTTCACGGCCT")
xdata = dna2vector(data)
head(xdata)

[Package astsa version 2.1 Index]