read_paf {gggenomes} | R Documentation |
Read a .paf file (minimap/minimap2).
Description
Read a minimap/minimap2 .paf file including optional tagged extra fields. The optional fields will be parsed into a tidy format, one column per tag.
Usage
read_paf(
file,
max_tags = 20,
col_names = def_names("paf"),
col_types = def_types("paf"),
...
)
Arguments
file |
Either a path to a file, a connection, or literal data (either a single string or a raw vector). Files ending in Literal data is most useful for examples and tests. To be recognised as
literal data, the input must be either wrapped with Using a value of |
max_tags |
maximum number of optional fields to include |
col_names |
column names to use. Defaults to |
col_types |
column types to use. Defaults to |
... |
additional parameters, passed to |
Details
Because readr::read_tsv
expects a fixed number of columns, but in .paf the
number of optional fields can differ among records, read_paf
tries to read
at least as many columns as the longest record has (max_tags
). The
resulting warnings for each record with fewer fields of the form "32 columns
expected, only 22 seen" should thus be ignored.
From the minimap2 manual
+—-+——–+———————————————————+ |Col | Type | Description | +—-+——–+———————————————————+ | 1 | string | Query sequence name | | 2 | int | Query sequence length | | 3 | int | Query start coordinate (0-based) | | 4 | int | Query end coordinate (0-based) | | 5 | char | ‘+’ if query/target on the same strand; ‘-’ if opposite | | 6 | string | Target sequence name | | 7 | int | Target sequence length | | 8 | int | Target start coordinate on the original strand | | 9 | int | Target end coordinate on the original strand | | 10 | int | Number of matching bases in the mapping | | 11 | int | Number bases, including gaps, in the mapping | | 12 | int | Mapping quality (0-255 with 255 for missing) | +—-+——–+———————————————————+
+—-+——+——————————————————-+ |Tag | Type | Description | +—-+——+——————————————————-+ | tp | A | Type of aln: P/primary, S/secondary and I,i/inversion | | cm | i | Number of minimizers on the chain | | s1 | i | Chaining score | | s2 | i | Chaining score of the best secondary chain | | NM | i | Total number of mismatches and gaps in the alignment | | MD | Z | To generate the ref sequence in the alignment | | AS | i | DP alignment score | | ms | i | DP score of the max scoring segment in the alignment | | nn | i | Number of ambiguous bases in the alignment | | ts | A | Transcript strand (splice mode only) | | cg | Z | CIGAR string (only in PAF) | | cs | Z | Difference string | | dv | f | Approximate per-base sequence divergence | +—-+——+——————————————————-+
From https://samtools.github.io/hts-specs/SAMtags.pdf type may be one of A (character), B (general array), f (real number), H (hexadecimal array), i (integer), or Z (string).
Value
tibble