search_dictionary {corpustools}R Documentation

Dictionary lookup

Description

Similar to search_features, but for fast matching of large dictionaries.

Usage

search_dictionary(
  tc,
  dict,
  token_col = "token",
  string_col = "string",
  code_col = "code",
  sep = " ",
  mode = c("unique_hits", "features"),
  case_sensitive = F,
  use_wildcards = T,
  ascii = F,
  verbose = F
)

Arguments

tc

A tCorpus

dict

A dictionary. Can be either a data.frame or a quanteda dictionary. If a data.frame is given, it has to have a column named "string" (or use string_col argument) that contains the dictionary terms, and a column "code" (or use code_col argument) that contains the label/code represented by this string. Each row has a single string, that can be a single word or a sequence of words seperated by a whitespace (e.g., "not bad"), and can have the common ? and * wildcards. If a quanteda dictionary is given, it is automatically converted to this type of data.frame with the melt_quanteda_dict function. This can be done manually for more control over labels.

token_col

The feature in tc that contains the token text.

string_col

If dict is a data.frame, the name of the column in dict with the dictionary lookup string. Default is "string"

code_col

The name of the column in dict with the dictionary code/label. Default is "code". If dict is a quanteda dictionary with multiple levels, "code_l2", "code_l3", etc. can be used to select levels..

sep

A regular expression for separating multi-word lookup strings (default is " ", which is what quanteda dictionaries use). For example, if the dictionary contains "Barack Obama", sep should be " " so that it matches the consequtive tokens "Barack" and "Obama". In some dictionaries, however, it might say "Barack+Obama", so in that case sep = '\\+' should be used.

mode

There are two modes: "unique_hits" and "features". The "unique_hits" mode prioritizes finding unique matches, which is recommended for counting how often a dictionary term occurs. If a term matches multiple dictionary terms (which should only happen for nested multi-word terms, such as "bad" and "not bad"), the longest term is always used. The features mode does not delete duplicates.

case_sensitive

logical, should lookup be case sensitive?

use_wildcards

Use the wildcards * (any number including none of any character) and ? (one or none of any character). If FALSE, exact string matching is used

ascii

If true, convert text to ascii before matching

verbose

If true, report progress

Value

A vector with the id value (taken from dict$id) for each row in tc$tokens

Examples

dict = data.frame(string = c('this is', 'for a', 'not big enough'), code=c('a','c','b'))
tc = create_tcorpus(c('this is a test','This town is not big enough for a test'))
search_dictionary(tc, dict)$hits

[Package corpustools version 0.5.1 Index]