cat_replace {messy.cats} | R Documentation |
cat_replace
Description
cat_replace()
replaces the contents of a messy vector with
the closest match in a clean vector. The closest match can be found
using a variety of different string distance measurement options.
Usage
cat_replace(
messy_v,
clean_v,
threshold = NA,
method = "jw",
q = 1,
p = 0,
bt = 0,
useBytes = FALSE,
weight = c(d = 1, i = 1, t = 1)
)
Arguments
messy_v |
The messy string vector that will be restructured. This can come in the form of a column of a dataframe or a lone vector. |
clean_v |
The clean string vector that will be referenced to perform the restructing. Again, this argument can be a dataframe column or vector. |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
method |
The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw' |
q |
Size of the q-gram used in string distance calculation. Default: 1 |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
bt |
Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0 |
useBytes |
Whether or not to perform byte-wise comparison. Default: FALSE |
weight |
Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1) |
Details
When dealing with messy categorical string data, string distance matching can be an easy and efficient cleaning tool. A variety of string distance calculation algorithms have been developed for different types of data, and these algorithms can be used to detect and remedy problems with categorical string data.
By providing a correctly spelled and specified vector of categories to be compared against a vector of messy strings, a cleaned vector of categories can be generated by finding the correctly specificed string most similar to a messy string. This method works particularly well for messy user-inputted data that often suffers from transposition or misspelling errors.
cat_replace()
replaces the elements of the messy vector with the closest matching
element from the clean vector.
Value
cat_replace() returns a cleaned version of the bad vector, with each element replaced by the most similar element of the good vector.
Examples
if(interactive()){
messy_trees = c("red oak", "williw", "hemluck", "white elm", "fir tree",
"birch tree", "pone", "dagwood", "mople")
clean_trees = c("oak", "willow", "hemlock", "elm", "fir", "birch", "pine", "dogwood", "maple")
cleaned_trees = cat_replace(messy_trees, clean_trees)
}