R: Read files in various standard formats (FASTA, GFF3, GBK,...

read_tracks {gggenomes}

R Documentation

Read files in various standard formats (FASTA, GFF3, GBK, BED, BLAST, ...) into track tables

Description

Convenience functions to read sequences, features or links from various bioinformatics file formats, such as FASTA, GFF3, Genbank, BLAST tabular output, etc. See def_formats() for full list. File formats and the corresponding read-functions are automatically determined based on file extensions. All these functions can read multiple files in the same format at once, and combine them into a single table - useful, for example, to read a folder of gff-files with each file containing genes of a different genome.

Usage

read_feats(files, .id = "file_id", format = NULL, parser = NULL, ...)

read_subfeats(files, .id = "file_id", format = NULL, parser = NULL, ...)

read_links(files, .id = "file_id", format = NULL, parser = NULL, ...)

read_sublinks(files, .id = "file_id", format = NULL, parser = NULL, ...)

read_seqs(
  files,
  .id = "file_id",
  format = NULL,
  parser = NULL,
  parse_desc = TRUE,
  ...
)

Arguments

`files`	files to reads. Should all be of same format. In many cases, compressed files (`.gz`, `.bz2`, `.xz`, or `.zip`) are supported. Similarly, automatic download of remote files starting with `⁠http(s)://⁠` or `⁠ftp(s)://⁠` works in most cases.
`.id`	the column with the name of the file a record was read from. Defaults to "file_id". Set to "bin_id" if every file represents a different bin.
`format`	specify a format known to gggenomes, such as `gff3`, `gbk`, ... to overwrite automatic determination based on the file extension (see `def_formats()` for full list).
`parser`	specify the name of an R function to overwrite automatic determination based on format, e.g. `parser="read_tsv"`.
`...`	additional arguments passed on to the format-specific read function called down the line.
`parse_desc`	turn `⁠key=some value⁠` pairs from `seq_desc` into `key`-named columns and remove them from `seq_desc`.

Value

A gggenomes-compatible sequence, feature or link tibble

tibble with features

tibble with links

tibble with sequence information

Functions

read_feats(): read files as features mapping onto sequences.
read_subfeats(): read files as subfeatures mapping onto other features
read_links(): read files as links connecting sequences
read_sublinks(): read files as sublinks connecting features
read_seqs(): read sequence ID, description and length.

Examples

# read genes/features from a gff file
read_feats(ex("eden-utr.gff"))


# read all gff files from a directory
read_feats(list.files(ex("emales/"), "*.gff$", full.names = TRUE))


# read remote files

gbk_phages <- c(
  PSSP7 = paste0(
    "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/",
    "000/858/745/GCF_000858745.1_ViralProj15134/",
    "GCF_000858745.1_ViralProj15134_genomic.gff.gz"
  ),
  PSSP3 = paste0(
    "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/",
    "000/904/555/GCF_000904555.1_ViralProj195517/",
    "GCF_000904555.1_ViralProj195517_genomic.gff.gz"
  )
)
read_feats(gbk_phages)


# read sequences from a fasta file.
read_seqs(ex("emales/emales.fna"), parse_desc = FALSE)

# read sequence info from a fasta file with `parse_desc=TRUE` (default). `key=value`
# pairs are removed from `seq_desc` and parsed into columns with `key` as name
read_seqs(ex("emales/emales.fna"))

# read sequence info from samtools/seqkit style index
read_seqs(ex("emales/emales.fna.seqkit.fai"))

# read sequence info from multiple gff file
read_seqs(c(ex("emales/emales.gff"), ex("emales/emales-tirs.gff")))

[Package gggenomes version 1.0.0 Index]