RadixTree {seqtrie} | R Documentation |
RadixTree
Description
Radix Tree (trie) class implementation
Details
The RadixTree class is a trie implementation. The primary usage is to be able to search of similar sequences based on a dynamic programming framework. This can be done using the search method which searches for similar sequences based on the Global, Anchored or Hamming distance metrics.
Three types of distance metrics are supported, based on the form of alignment performed. These are: Hamming, Global (Levenshtein) and Anchored.
An anchored alignment is a form of semi-global alignment, where the query sequence is "anchored" (global) to the beginning of both the query and target sequences, but is semi-global in that the end of the either the query sequence or target sequence (but not both) can be unaligned. This type of alignment is sometimes called an "extension" alignment in literature.
In contrast a global alignment must align the entire query and target sequences. When mismatch and indel costs are equal to 1, this is also known as the Levenshtein distance.
By default, if mode == "global" or "anchored", all mismatches and indels are given a cost of 1. However, you can define your own distance metric by setting the cost_matrix and gap parameters. The cost_matrix is a strictly positive square integer matrix and should include all characters in query and target as column- and rownames. To set the cost of a gap (insertion or deletion) you can include a row and column named "gap" in the cost_matrix OR set the gap_cost parameter (a single positive integer). Similarly, the affine gap alignment can be set by including a row and column named "gap_open" in the cost_matrix OR setting the gap_open_cost parameter (a single positive integer). If affine alignment is used, the cost of a gap is defined as: TOTAL_GAP_COST = gap_open_cost + (gap_cost * gap_length).
If mode == "hamming" all alignment parameters are ignored; mismatch is given a distance of 1 and gaps are not allowed.
Public fields
root_pointer
Root of the RadixTree
char_counter_pointer
Character count data for the purpose of validating input
Methods
Public methods
Method new()
Create a new RadixTree object
Usage
RadixTree$new(sequences = NULL)
Arguments
sequences
A character vector of sequences to insert into the tree
Method show()
Print the tree to screen
Usage
RadixTree$show()
Method to_string()
Print the tree to a string
Usage
RadixTree$to_string()
Returns
A string representation of the tree
Method graph()
Plot of the tree using igraph (needs to be installed separately)
Usage
RadixTree$graph(depth = -1, root_label = "root", plot = TRUE)
Arguments
depth
The tree depth to plot. If -1 (default), plot the entire tree.
root_label
The label of the root node in the plot.
plot
Whether to create a plot or return the data used to generate the plot.
Returns
A data frame of parent-child relationships used to generate the igraph plot OR a ggplot2 object
Method to_vector()
Output all sequences held by the tree as a character vector
Usage
RadixTree$to_vector()
Returns
A character vector of all sequences contained in the tree. Return order is not guaranteed.
Method size()
Output the size of the tree (i.e. how many sequences are contained)
Usage
RadixTree$size()
Returns
The size of the tree
Method insert()
Insert new sequences into the tree
Usage
RadixTree$insert(sequences)
Arguments
sequences
A character vector of sequences to insert into the tree
Returns
A logical vector indicating whether the sequence was inserted (TRUE) or already existing in the tree (FALSE)
Method erase()
Erase sequences from the tree
Usage
RadixTree$erase(sequences)
Arguments
sequences
A character vector of sequences to erase from the tree
Returns
A logical vector indicating whether the sequence was erased (TRUE) or not found in the tree (FALSE)
Method find()
Find sequences in the tree
Usage
RadixTree$find(query)
Arguments
query
A character vector of sequences to find in the tree
Returns
A logical vector indicating whether the sequence was found (TRUE) or not found in the tree (FALSE)
Method prefix_search()
Search for sequences in the tree that start with a specified prefix. E.g.: a query of "CAR" will find "CART", "CARBON", "CARROT", etc. but not "CATS".
Usage
RadixTree$prefix_search(query)
Arguments
query
A character vector of sequences to search for in the tree
Returns
A data frame of all matches with columns "query" and "target".
Method search()
Search for sequences in the tree that are with a specified distance metric to a specified query.
Usage
RadixTree$search( query, max_distance = NULL, max_fraction = NULL, mode = "levenshtein", cost_matrix = NULL, gap_cost = NULL, gap_open_cost = NULL, nthreads = 1, show_progress = FALSE )
Arguments
query
A character vector of query sequences.
max_distance
how far to search in units of absolute distance. Can be a single value or a vector. Mutually exclusive with max_fraction.
max_fraction
how far to search in units of relative distance to each query sequence length. Can be a single value or a vector. Mutually exclusive with max_distance.
mode
The distance metric to use. One of hamming (hm), global (gb) or anchored (an).
cost_matrix
A custom cost matrix for use with the "global" or "anchored" distance metrics. See details.
gap_cost
The cost of a gap for use with the "global" or "anchored" distance metrics. See details.
gap_open_cost
The cost of a gap opening. See details.
nthreads
The number of threads to use for parallel computation.
show_progress
Whether to show a progress bar.
Returns
The output is a data.frame of all matches with columns "query" and "target". For anchored searches, the output also includes attributes "query_size" and "target_size" which are vectors containing the portion of the query and target sequences that are aligned.
Method validate()
Validate the tree
Usage
RadixTree$validate()
Returns
A logical indicating whether the tree is valid (TRUE) or not (FALSE). This is mostly an internal function for debugging purposes and should always return TRUE.
See Also
https://en.wikipedia.org/wiki/Radix_tree
Examples
tree <- RadixTree$new()
tree$insert(c("ACGT", "AAAA"))
tree$erase("AAAA")
tree$search("ACG", max_distance = 1, mode = "levenshtein")
# query target distance
# 1 ACG ACGT 1
tree$search("ACG", max_distance = 1, mode = "hamming")
# query target distance
# <0 rows> (or 0-length row.names)