R: cat_join

cat_join {messy.cats}

R Documentation

cat_join

Description

cat_join() joins two dataframes using the closest match between two specified columns with misspellings or slight format differences. The closest match can be found using a variety of different string distance measurement options.

Usage

cat_join(
  messy_df,
  clean_df,
  by,
  threshold = NA,
  method = "jw",
  q = 1,
  p = 0,
  bt = 0,
  useBytes = FALSE,
  weight = c(d = 1, i = 1, t = 1),
  join = "left"
)

Arguments

`messy_df`	The dataframe to be joined using a messy categorical variable.
`clean_df`	The dataframe to be joined with a clean categorical variable to be used as a reference for the messy column.
`by`	A vector that specifies the columns to match and join by. If the column names are the same input: "column_name". If the columns have different names input: c("messy_column" = "clean_column")
`threshold`	The maximum distance that will form a match. If this argument is specified, any element in the messy vector that has no match closer than the threshold distance will be replaced with NA. Default: NA
`method`	The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw'
`q`	Size of the q-gram used in string distance calculation. Default: 1
`p`	Only used with method "jw", the Jaro-Winkler penatly size. Default: 0
`bt`	Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0
`useBytes`	Whether or not to perform byte-wise comparison. Default: FALSE
`weight`	Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1)
`join`	Choose a join function from the dplyr package to use in joining the datasets. Default: 'left'

Details

When dealing with messy categorical string data, string distance matching can be an easy and efficient cleaning tool. A variety of string distance calculation algorithms have been developed for different types of data, and these algorithms can be used to detect and remedy problems with categorical string data.

By providing a correctly spelled and specified vector of categories to be compared against a vector of messy strings, a cleaned vector of categories can be generated by finding the correctly specificed string most similar to a messy string. This method works particularly well for messy user-inputted data that often suffers from transposition or misspelling errors.

cat_join() joins the messy and clean datasets using the closest matching elements from designated columns. The columns from the datasets are inputted into cat_replace() as the messy and clean vectors, and the datasets are joined using a user inputted dplyr join verb.

Value

Returns a dataframe consisting of the two inputted dataframes joined by their designated columns.

Examples

if(interactive()){
 #EXAMPLE1
 messy_trees = data.frame()
 messy_trees[1:9,1] = c("red oak", "williw", "hemluck", "white elm",
  "fir tree", "birch tree", "pone", "dagwood", "mople")
 messy_trees[1:9,2] = c(34,12,43,32,65,23,12,45,35)
 clean_trees=data.frame()
 clean_trees[1:9,1] = c("oak", "willow", "hemlock", "elm", "fir",
 "birch", "pine", "dogwood", "maple")
 clean_trees[1:9,2] = "y"
 cat_join(messy_trees,clean_trees,by="V1",method="jaccard")
 }

[Package messy.cats version 1.0 Index]