R: Distance search for similar sequences

dist_search {seqtrie}

R Documentation

Distance search for similar sequences

Description

Find similar sequences within a distance threshold

Usage

dist_search(
  query,
  target,
  max_distance = NULL,
  max_fraction = NULL,
  mode = "levenshtein",
  cost_matrix = NULL,
  gap_cost = NULL,
  gap_open_cost = NULL,
  tree_class = "RadixTree",
  nthreads = 1,
  show_progress = FALSE
)

Arguments

`query`	A character vector of query sequences.
`target`	A character vector of target sequences.
`max_distance`	how far to search in units of absolute distance. Can be a single value or a vector. Mutually exclusive with max_fraction.
`max_fraction`	how far to search in units of relative distance to each query sequence length. Can be a single value or a vector. Mutually exclusive with max_distance.
`mode`	The distance metric to use. One of hamming (hm), global (gb) or anchored (an).
`cost_matrix`	A custom cost matrix for use with the "global" or "anchored" distance metrics. See details.
`gap_cost`	The cost of a gap for use with the "global" or "anchored" distance metrics. See details.
`gap_open_cost`	The cost of a gap opening. See details.
`tree_class`	Which R6 class to use. Either RadixTree or RadixForest (default: RadixTree)
`nthreads`	The number of threads to use for parallel computation.
`show_progress`	Whether to show a progress bar.

Details

This function finds all sequences in target that are within a distance threshold of any sequence in query. This function uses either a RadixTree or RadixForest to store target sequences. See the R6 class documentation for additional details.

Three types of distance metrics are supported, based on the form of alignment performed. These are: Hamming, Global (Levenshtein) and Anchored.

An anchored alignment is a form of semi-global alignment, where the query sequence is "anchored" (global) to the beginning of both the query and target sequences, but is semi-global in that the end of the either the query sequence or target sequence (but not both) can be unaligned. This type of alignment is sometimes called an "extension" alignment in literature.

In contrast a global alignment must align the entire query and target sequences. When mismatch and indel costs are equal to 1, this is also known as the Levenshtein distance.

By default, if mode == "global" or "anchored", all mismatches and indels are given a cost of 1. However, you can define your own distance metric by setting the cost_matrix and gap parameters. The cost_matrix is a strictly positive square integer matrix and should include all characters in query and target as column- and rownames. To set the cost of a gap (insertion or deletion) you can include a row and column named "gap" in the cost_matrix OR set the gap_cost parameter (a single positive integer). Similarly, the affine gap alignment can be set by including a row and column named "gap_open" in the cost_matrix OR setting the gap_open_cost parameter (a single positive integer). If affine alignment is used, the cost of a gap is defined as: TOTAL_GAP_COST = gap_open_cost + (gap_cost * gap_length).

If mode == "hamming" all alignment parameters are ignored; mismatch is given a distance of 1 and gaps are not allowed.

Value

The output is a data.frame of all matches with columns "query" and "target". For anchored searches, the output also includes attributes "query_size" and "target_size" which are vectors containing the portion of the query and target sequences that are aligned.

Examples

dist_search(c("ACGT", "AAAA"), c("ACG", "ACGT"), max_distance = 1, mode = "levenshtein")

[Package seqtrie version 0.2.8 Index]