spell_check {fossilbrush}R Documentation

spell_check

Description

Function for checking for potential synonyms with alternate spellings. Synonyms are checked for within group using using a Jaro Winkler string distance matrix. Potential synonyms are selected using the jw threshold. These can then be further filtered by the number of shared letters at the beginning and end of the a synonym pair, and by prefixes or suffixes which may give erroneously high similarities.

Usage

spell_check(
  x,
  terms = NULL,
  groups = NULL,
  jw = 0.1,
  str = 1,
  str2 = NULL,
  alternative = "jaccard",
  q = 1,
  pref = NULL,
  suff = NULL,
  exclude = NULL,
  verbose = TRUE
)

Arguments

x

a dataframe containing a column with terms, and a further column denoting the groups within which terms will be checked against one another. If supplying a dataframe with just these columns, terms should be column 1

terms

a character vector of length 1, specifying the terms column in x. This is required if x contains more than two columns. Alternatively, if x is not provided, terms can be a character vector. If groups are not specified, all elements of terms will be treated as part of the same group

groups

a character vector of length 1, specifying the groups column in x. This is required if x contains more than two columns. Alternatively, if terms is supplied as a character vector, groups can also be supplied in the same way to denote their groups

jw

a numeric greater than 0 and less than 1. This is the distance threshold below which potential synonyms will be considered

str

A positive integer specifying the number of matching characters at the beginning of synonym pairs. By default 1, i.e. the first letters must match

str2

If not NULL, a positive integer specifying the number of matching characters at the end of synonym pairs

alternative

A character string of length one corresponding to one of the methods used by @seealso afind. One of "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine", "jaccard", or "soundex".

q

q-gram size. Only used when alternative is "qgram", "cosine" or "Jaccard".

pref

If not NULL, a character vector of prefixes which may result in erroneously low JW distances. Synonyms will only be considered if both terms share the same prefix

suff

If not NULL, a character vector of suffices which may result in erroneously low JW distances. Synonyms will only be considered if both terms share the same suffix

exclude

If not NULL, a character vector of group names which should be skipped - useful for groups which are known to contain potentially similar terms

verbose

A logical determining if function progress be reported using the pbapply progress bar

Value

a dataframe of synonyms (cols 1 and 2), the group in which they occur, the frequencies of each synonym in the dataset and finally the q-gram difference between the synonyms

Examples

# load dataset
data("brachios")
# define suffixes
b_suff <- c("ina", "ella", "etta")
# run function
spl <- spell_check(brachios, terms = "genus", groups = "family", suff = b_suff)

[Package fossilbrush version 1.0.3 Index]