fuzzy_rbind {messy.cats}R Documentation

fuzzy_rbind

Description

fuzzy_rbind() binds dataframes based on columns with slightly different names.

Usage

fuzzy_rbind(
  df1,
  df2,
  threshold,
  method = "jw",
  q = 1,
  p = 0,
  bt = 0,
  useBytes = FALSE,
  weight = c(d = 1, i = 1, t = 1)
)

Arguments

df1

The first dataframe to be bound.

df2

The second dataframe to be bound.

threshold

The maximum string distance between column names, if the distance between columns is greater than this threshold the columns will not be bound.

method

The type of string distance calculation to use. Possible methods are : osa, lv, dl, hamming, lcs, qgram, cosine, jaccard, jw, and soundex. See package stringdist for more information. Default: 'jw', Default: 'jw'

q

Size of the q-gram used in string distance calculation. Default: 1

p

Only used with method "jw", the Jaro-Winkler penatly size. Default: 0

bt

Only used with method "jw" with p > 0, Winkler's boost threshold. Default: 0

useBytes

Whether or not to perform byte-wise comparison. Default: FALSE

weight

Only used with methods "osa" or "dl", a vector representing the penalty for deletion, insertion, substitution, and transposition, in that order. Default: c(d = 1, i = 1, t = 1)

Details

When using datasets often times column names are slightly different, and fuzzy_rbind() helps to bind dataframes using fuzzy matching of the column names.

Value

fuzzy_rbind() returns a dataframe that has bound the two inputted dataframes based on the closest matching columns, column names from dataframe 1 are preserved.

Examples

if(interactive()){
 mtcars_colnames_messy = mtcars
 colnames(mtcars_colnames_messy)[1:5] = paste0(colnames(mtcars)[1:5], "_17")
 colnames(mtcars_colnames_messy)[6:11] = paste0(colnames(mtcars)[6:11], "_2017")
 x = fuzzy_rbind(mtcars, mtcars_colnames_messy, .5)
 x = fuzzy_rbind(mtcars, mtcars_colnames_messy, .2)
 }

[Package messy.cats version 1.0 Index]