fix_typos {messy.cats}R Documentation

fix_typos

Description

This function is meant to allow users to fix typos in strings that are not normally found in dictionaries.

Usage

fix_typos(typo_v, thr, occ_ratio)

Arguments

typo_v

vector of strings that will have its typos cleaned

thr

the string distance maximum used to determine typos. This argument is specified as the percentage of a typo that should at most be expected to be insertions, additons, deletions, and transpositions.

occ_ratio

the minimum ratio of correctly spelled words to their typo. This argument helps to weed out words that are similar but valid. For example commonly occurring valid names such as Adam and Amy will not be recognized as typos even though they are similar because they both appear often. Typos are recognized by their similarity in addition to their infrequent occurrence.

Details

There are great tools like the hunspell package that allow users to fix typos for words found in dictionaries, but these functions struggle to work for strings like proper nouns and other specific terminology not usually found in common dictionaries. This function uses the text being cleaned as a dictionary. It finds probable correctly spelled words based on their high occurrence and finds typos based on their low occurence. This is based on the theory that typos will appear as infrequently used words due no one using them on purpose, and they will be a short string distance from commonly occurring correctly spelled words.

Value

reformatted vector with typos replaced with correctly spelled words

Examples

if(interactive()){
 #EXAMPLE1
 }

[Package messy.cats version 1.0 Index]