cat_match {messy.cats}R Documentation

cat_match

Description

cat_match() matches the contents of a messy vector with the closest match in a clean vector. The closest match can be found using a variety of different string distance measurement options.

Usage

cat_match(
  messy_v,
  clean_v,
  return_dists = TRUE,
  return_lists = NA,
  pick_lists = FALSE,
  threshold = NA,
  method = "jw",
  q = 1,
  p = 0,
  bt = 0,
  useBytes = FALSE,
  weight = c(d = 1, i = 1, t = 1)
)

Arguments

messy_v

The messy string vector that will be restructured. This can come in the form of a column of a dataframe or a lone vector.

clean_v

The clean string vector that will be referenced to perform the restructing. Again, this argument can be a dataframe column or vector.

return_dists

If set to TRUE the distance between the matched strings will be returned as a third column in the output dataframe, Default: TRUE

return_lists

Return list of top X matches, Default: NA

pick_lists

Set to TRUE to manually choose matches, Default: F

threshold

The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA

method

The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw'

q

Size of the q-gram used in string distance calculation. Default: 1

p

Only used with method "jw", the Jaro-Winkler penatly size. Default: 0

bt

Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0

useBytes

Whether or not to perform byte-wise comparison. Default: FALSE

weight

Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1)

Details

When dealing with messy categorical string data, string distance matching can be an easy and efficient cleaning tool. A variety of string distance calculation algorithms have been developed for different types of data, and these algorithms can be used to detect and remedy problems with categorical string data.

By providing a correctly spelled and specified vector of categories to be compared against a vector of messy strings, a cleaned vector of categories can be generated by finding the correctly specificed string most similar to a messy string. This method works particularly well for messy user-inputted data that often suffers from transposition or misspelling errors.

cat_match() is meant as an exploratory tool to discover how the elements of two vectors will match using string distance measures, and has added functionality to solve issues by hand and create a dataframe that can be used to create custom matches between the clean and messy vectors.

Value

Returns a dataframe with each unique value in the bad vector and it's closest match in the good vector. If return_dists is TRUE the distances between the matches are added as a column.

Examples

if(interactive()){
 messy_trees = c("red oak", "williw", "hemluck", "white elm",
 "fir tree", "birch tree", "pone", "dagwood", "mople")
 clean_trees = c("oak", "willow", "hemlock", "elm", "fir", "birch", "pine", "dogwood", "maple")
 matched_trees = cat_match(messy_trees, clean_trees)
 }

[Package messy.cats version 1.0 Index]