R: Fast searching for one or more keywords in a list of texts

AhoCorasickSearchList {AhoCorasickTrie}

R Documentation

Fast searching for one or more keywords in a list of texts

Description

Builds an Aho-Corasick trie from one or more keywords and uses it to search a list of one or more texts. For a large number of keywords, Aho-Corasick is much faster than a naive approach (such as lapply(keywords, gregexpr, text)).

Use AhoCorasickSearchList instead of AhoCorasickSearch when you want to keep the matches of each input sublist separate. If the sublists of the input list have names, the resulting list of lists will use those names, but sublists with no matches will still be in the resulting list. If the texts of the sublists have names, the resulting sublists of matches will use those names, and the texts with no matches will be dropped. If the input texts do not have names, then the resulting sublists of matches will be in the same order as the input texts, and non-matched texts will be kept to preserve that order. Thus, it is more efficient to use named input texts (so non-matched texts can be dropped).

The default alphabet allows all 128 ASCII characters in the keywords and the texts. Characters outside this range will cause an error. A more efficient trie is possible if the alphabet size can be reduced. For example, DNA sequences use at most 19 distinct characters and usually only 4; protein sequences use at most 26 distinct characters and usually only 20. Set the alphabet parameter if a reduced alphabet is appropriate.

UTF-8 (Unicode) matching is not currently supported.

Usage

AhoCorasickSearchList(
  keywords,
  textList,
  alphabet = "ascii",
  groupByKeyword = FALSE,
  iterationFeedback = 0L
)

Arguments

`keywords`	Character vector of one or more keywords
`textList`	List of lists, each sublist with one or more texts to search
`alphabet`	Alphabet to use; one of `ascii`, `aminoacid`, or `nucleicacid`
`groupByKeyword`	If true, matches are grouped by keyword (instead of by text)
`iterationFeedback`	When set to a positive integer `i`, console output will indicate when searching every `i`th text

Value

List of lists of matches, grouped by either text or by keyword (each list of texts gets its own list of matches)

Examples

listEquals = function(a, b) { is.null(unlist(a)) && is.null(unlist(b)) ||
                              !is.null(a) && !is.null(b) && all(unlist(a) == unlist(b)) }
keywords = c("Abra", "cadabra", "is", "the", "Magic", "Word")

# 1. Search a list of lists without names
# * sublists are accessed by index
# * texts are accessed by index
# * non-matched texts are kept (input index order is preserved)
listSearch = AhoCorasickSearchList(keywords,
                                   list(c("What in", "the world"),
                                        c("is"),
                                        "secret about",
                                        "the Magic Word?"))
stopifnot(listEquals(listSearch[[1]][[1]], list()))
stopifnot(listEquals(listSearch[[1]][[2]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[2]][[1]][[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(listSearch[[3]], list()))
stopifnot(listEquals(listSearch[[4]][[1]][[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(listSearch[[4]][[1]][[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(listSearch[[4]][[1]][[3]], list(keyword="Word", offset=11)))

# 2. Search a named list of named lists
# * sublists are accessed by name
# * matched texts are accessed by name
# * non-matched texts are dropped
namedSearch = AhoCorasickSearchList(keywords,
                                    list(subject=c(phrase1="What in", phrase2="the world"),
                                         verb=c(phrase1="is"),
                                         predicate1=c(phrase1="secret about"),
                                         predicate2=c(phrase1="the Magic Word?")))
stopifnot(listEquals(namedSearch$subject$phrase2[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$verb$phrase1[[1]], list(keyword="is", offset=1)))
stopifnot(listEquals(namedSearch$predicate1, list()))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[1]], list(keyword="the", offset=1)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[2]], list(keyword="Magic", offset=5)))
stopifnot(listEquals(namedSearch$predicate2$phrase1[[3]], list(keyword="Word", offset=11)))