cat_replace {messy.cats}R Documentation

cat_replace

Description

cat_replace() replaces the contents of a messy vector with the closest match in a clean vector. The closest match can be found using a variety of different string distance measurement options.

Usage

cat_replace(
  messy_v,
  clean_v,
  threshold = NA,
  method = "jw",
  q = 1,
  p = 0,
  bt = 0,
  useBytes = FALSE,
  weight = c(d = 1, i = 1, t = 1)
)

Arguments

messy_v

The messy string vector that will be restructured. This can come in the form of a column of a dataframe or a lone vector.

clean_v

The clean string vector that will be referenced to perform the restructing. Again, this argument can be a dataframe column or vector.

threshold

The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA

method

The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw'

q

Size of the q-gram used in string distance calculation. Default: 1

p

Only used with method "jw", the Jaro-Winkler penatly size. Default: 0

bt

Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0

useBytes

Whether or not to perform byte-wise comparison. Default: FALSE

weight

Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1)

Details

When dealing with messy categorical string data, string distance matching can be an easy and efficient cleaning tool. A variety of string distance calculation algorithms have been developed for different types of data, and these algorithms can be used to detect and remedy problems with categorical string data.

By providing a correctly spelled and specified vector of categories to be compared against a vector of messy strings, a cleaned vector of categories can be generated by finding the correctly specificed string most similar to a messy string. This method works particularly well for messy user-inputted data that often suffers from transposition or misspelling errors.

cat_replace() replaces the elements of the messy vector with the closest matching element from the clean vector.

Value

cat_replace() returns a cleaned version of the bad vector, with each element replaced by the most similar element of the good vector.

Examples

if(interactive()){
 messy_trees = c("red oak", "williw", "hemluck", "white elm", "fir tree",
  "birch tree", "pone", "dagwood", "mople")
 clean_trees = c("oak", "willow", "hemlock", "elm", "fir", "birch", "pine", "dogwood", "maple")
 cleaned_trees = cat_replace(messy_trees, clean_trees)

 }

[Package messy.cats version 1.0 Index]