| match_df {matchmaker} | R Documentation |
Check and clean spelling or codes of multiple variables in a data frame
Description
This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).
Usage
match_df(
x = data.frame(),
dictionary = list(),
from = 1,
to = 2,
by = 3,
order = NULL,
warn = FALSE
)
Arguments
x |
a character or factor vector |
dictionary |
a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the dictionary by column in |
from |
a column name or position defining words or keys to be replaced |
to |
a column name or position defining replacement values |
by |
character or integer. If |
order |
a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted. |
warn |
if |
Details
By default, this applies the function match_vec() to all
columns specified by the column names listed in by, or, if a
global dictionary is used, this includes all character and factor
columns as well.
by column
Spelling variables within dictionary represent keys that you want to match
to column names in x (the data set). These are expected to match exactly
with the exception of two reserved keywords that starts with a full stop:
-
.regex [pattern]: any column whose name is matched by[pattern]. The[pattern]should be an unquoted, valid, PERL-flavored regular expression. -
.global: any column (see Section Global dictionary)
Global dictionary
A global dictionary is a set of definitions applied to all valid columns of
x indiscriminantly.
-
.global keyword in
by: If you want to apply a set of definitions to all valid columns in addition to specified columns, then you can include a.globalgroup in thebycolumn of yourdictionarydata frame. This is useful for setting up a dictionary of common spelling errors. NOTE: specific variable definitions will override global defintions. For example: if you have a column for cardinal directions and a definiton forN = North, then the global variableN = nowill not override that. See Example. -
by = NULL: If you want your data frame to be applied to all character/factor columns indiscriminantly, then settingby = NULLwill use that dictionary globally.
Value
a data frame with re-defined data based on the dictionary
Author(s)
Zhian N. Kamvar
Patrick Barks
See Also
match_vec(), which this function wraps.
Examples
# Read in dictionary and coded date examples --------------------
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
stringsAsFactors = FALSE)
dat <- read.csv(matchmaker_example("coded-data.csv"),
stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)
# Clean spelling based on dictionary -----------------------------
dict # show the dict
head(dat) # show the data
res1 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp")
head(res1)
# Show warnings/errors from each column --------------------------
# Internally, the `match_vec()` function can be quite noisy with warnings for
# various reasons. Thus, by default, the `match_df()` function will keep
# these quiet, but you can have them printed to your console if you use the
# warn = TRUE option:
res1 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp",
warn = TRUE)
head(res1)
# You can ensure the order of the factors are correct by specifying
# a column that defines order.
dat[] <- lapply(dat, as.factor)
as.list(head(dat))
res2 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp",
order = "orders")
head(res2)
as.list(head(res2))