spell_check {fossilbrush} | R Documentation |
spell_check
Description
Function for checking for potential synonyms with alternate spellings. Synonyms are checked for within group using using a Jaro Winkler string distance matrix. Potential synonyms are selected using the jw threshold. These can then be further filtered by the number of shared letters at the beginning and end of the a synonym pair, and by prefixes or suffixes which may give erroneously high similarities.
Usage
spell_check(
x,
terms = NULL,
groups = NULL,
jw = 0.1,
str = 1,
str2 = NULL,
alternative = "jaccard",
q = 1,
pref = NULL,
suff = NULL,
exclude = NULL,
verbose = TRUE
)
Arguments
x |
a dataframe containing a column with terms, and a further column denoting the groups within which terms will be checked against one another. If supplying a dataframe with just these columns, terms should be column 1 |
terms |
a character vector of length 1, specifying the terms column in x. This is required if x contains more than two columns. Alternatively, if x is not provided, terms can be a character vector. If groups are not specified, all elements of terms will be treated as part of the same group |
groups |
a character vector of length 1, specifying the groups column in x. This is required if x contains more than two columns. Alternatively, if terms is supplied as a character vector, groups can also be supplied in the same way to denote their groups |
jw |
a numeric greater than 0 and less than 1. This is the distance threshold below which potential synonyms will be considered |
str |
A positive integer specifying the number of matching characters at the beginning of synonym pairs. By default 1, i.e. the first letters must match |
str2 |
If not NULL, a positive integer specifying the number of matching characters at the end of synonym pairs |
alternative |
A character string of length one corresponding to one of the methods used by @seealso afind. One of "osa", "lv", "dl", "hamming", "lcs", "qgram", "cosine", "running_cosine", "jaccard", or "soundex". |
q |
q-gram size. Only used when alternative is "qgram", "cosine" or "Jaccard". |
pref |
If not NULL, a character vector of prefixes which may result in erroneously low JW distances. Synonyms will only be considered if both terms share the same prefix |
suff |
If not NULL, a character vector of suffices which may result in erroneously low JW distances. Synonyms will only be considered if both terms share the same suffix |
exclude |
If not NULL, a character vector of group names which should be skipped - useful for groups which are known to contain potentially similar terms |
verbose |
A logical determining if function progress be reported using the pbapply progress bar |
Value
a dataframe of synonyms (cols 1 and 2), the group in which they occur, the frequencies of each synonym in the dataset and finally the q-gram difference between the synonyms
Examples
# load dataset
data("brachios")
# define suffixes
b_suff <- c("ina", "ella", "etta")
# run function
spl <- spell_check(brachios, terms = "genus", groups = "family", suff = b_suff)