RadixForest {seqtrie}R Documentation

RadixForest

Description

Radix Forest class implementation

Details

The RadixForest class is a specialization of the RadixTree implementation. Instead of putting sequences into a single tree, the RadixForest class puts sequences into separate trees based on sequence length. This allows for faster searching of similar sequences based on Hamming or Levenshtein distance metrics. Unlike the RadixTree class, the RadixForest class does not support anchored searches or a custom cost matrix. See RadixTree for additional details.

Public fields

forest_pointer

Map of sequence length to RadixTree

char_counter_pointer

Character count data for the purpose of validating input

Methods

Public methods


Method new()

Create a new RadixForest object

Usage
RadixForest$new(sequences = NULL)
Arguments
sequences

A character vector of sequences to insert into the forest


Method show()

Print the forest to screen

Usage
RadixForest$show()

Method to_string()

Print the forest to a string

Usage
RadixForest$to_string()

Method graph()

Plot of the forest using igraph

Usage
RadixForest$graph(depth = -1, root_label = "root", plot = TRUE)
Arguments
depth

The tree depth to plot for each tree in the forest.

root_label

The label of the root node(s) in the plot.

plot

Whether to create a plot or return the data used to generate the plot.

Returns

A data frame of parent-child relationships used to generate the igraph plot OR a ggplot2 object


Method to_vector()

Output all sequences held by the forest as a character vector

Usage
RadixForest$to_vector()
Returns

A character vector of all sequences contained in the forest.


Method size()

Output the size of the forest (i.e. how many sequences are contained)

Usage
RadixForest$size()
Returns

The size of the forest


Method insert()

Insert new sequences into the forest

Usage
RadixForest$insert(sequences)
Arguments
sequences

A character vector of sequences to insert into the forest

Returns

A logical vector indicating whether the sequence was inserted (TRUE) or already existing in the forest (FALSE)


Method erase()

Erase sequences from the forest

Usage
RadixForest$erase(sequences)
Arguments
sequences

A character vector of sequences to erase from the forest

Returns

A logical vector indicating whether the sequence was erased (TRUE) or not found in the forest (FALSE)


Method find()

Find sequences in the forest

Usage
RadixForest$find(query)
Arguments
query

A character vector of sequences to find in the forest

Returns

A logical vector indicating whether the sequence was found (TRUE) or not found in the forest (FALSE)


Method prefix_search()

Search for sequences in the forest that start with a specified prefix. E.g.: a query of "CAR" will find "CART", "CARBON", "CARROT", etc. but not "CATS".

Usage
RadixForest$prefix_search(query)
Arguments
query

A character vector of sequences to search for in the forest

Returns

A data frame of all matches with columns "query" and "target".


Method search()

Search for sequences in the forest that are with a specified distance metric to a specified query.

Usage
RadixForest$search(
  query,
  max_distance = NULL,
  max_fraction = NULL,
  mode = "levenshtein",
  nthreads = 1,
  show_progress = FALSE
)
Arguments
query

A character vector of query sequences.

max_distance

how far to search in units of absolute distance. Can be a single value or a vector. Mutually exclusive with max_fraction.

max_fraction

how far to search in units of relative distance to each query sequence length. Can be a single value or a vector. Mutually exclusive with max_distance.

mode

The distance metric to use. One of hamming (hm), global (gb) or anchored (an).

nthreads

The number of threads to use for parallel computation.

show_progress

Whether to show a progress bar.

Returns

The output is a data.frame of all matches with columns "query" and "target".


Method validate()

Validate the forest

Usage
RadixForest$validate()
Returns

A logical indicating whether the forest is valid (TRUE) or not (FALSE). This is mostly an internal function for debugging purposes and should always return TRUE.

Examples

forest <- RadixForest$new()
forest$insert(c("ACGT", "AAAA"))
forest$erase("AAAA")
forest$search("ACG", max_distance = 1, mode = "levenshtein")
 #   query target distance
 # 1   ACG   ACGT        1
 
forest$search("ACG", max_distance = 1, mode = "hamming")
 # query    target   distance
 # <0 rows> (or 0-length row.names)

[Package seqtrie version 0.2.8 Index]