cat_match {messy.cats} | R Documentation |
cat_match
Description
cat_match()
matches the contents of a messy vector with
the closest match in a clean vector. The closest match can be found
using a variety of different string distance measurement options.
Usage
cat_match(
messy_v,
clean_v,
return_dists = TRUE,
return_lists = NA,
pick_lists = FALSE,
threshold = NA,
method = "jw",
q = 1,
p = 0,
bt = 0,
useBytes = FALSE,
weight = c(d = 1, i = 1, t = 1)
)
Arguments
messy_v |
The messy string vector that will be restructured. This can come in the form of a column of a dataframe or a lone vector. |
clean_v |
The clean string vector that will be referenced to perform the restructing. Again, this argument can be a dataframe column or vector. |
return_dists |
If set to TRUE the distance between the matched strings will be returned as a third column in the output dataframe, Default: TRUE |
return_lists |
Return list of top X matches, Default: NA |
pick_lists |
Set to TRUE to manually choose matches, Default: F |
threshold |
The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA |
method |
The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw' |
q |
Size of the q-gram used in string distance calculation. Default: 1 |
p |
Only used with method "jw", the Jaro-Winkler penatly size. Default: 0 |
bt |
Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0 |
useBytes |
Whether or not to perform byte-wise comparison. Default: FALSE |
weight |
Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1) |
Details
When dealing with messy categorical string data, string distance matching can be an easy and efficient cleaning tool. A variety of string distance calculation algorithms have been developed for different types of data, and these algorithms can be used to detect and remedy problems with categorical string data.
By providing a correctly spelled and specified vector of categories to be compared against a vector of messy strings, a cleaned vector of categories can be generated by finding the correctly specificed string most similar to a messy string. This method works particularly well for messy user-inputted data that often suffers from transposition or misspelling errors.
cat_match()
is meant as an exploratory tool to discover how the elements
of two vectors will match using string distance measures, and has added functionality
to solve issues by hand and create a dataframe that can be used to create custom
matches between the clean and messy vectors.
Value
Returns a dataframe with each unique value in the bad vector and it's closest match in the good vector. If return_dists is TRUE the distances between the matches are added as a column.
Examples
if(interactive()){
messy_trees = c("red oak", "williw", "hemluck", "white elm",
"fir tree", "birch tree", "pone", "dagwood", "mople")
clean_trees = c("oak", "willow", "hemlock", "elm", "fir", "birch", "pine", "dogwood", "maple")
matched_trees = cat_match(messy_trees, clean_trees)
}