match_df {matchmaker} | R Documentation |
Check and clean spelling or codes of multiple variables in a data frame
Description
This function allows you to clean your data according to pre-defined rules encapsulated in either a data frame or list of data frames. It has application for addressing mis-spellings and recoding variables (e.g. from electronic survey data).
Usage
match_df(
x = data.frame(),
dictionary = list(),
from = 1,
to = 2,
by = 3,
order = NULL,
warn = FALSE
)
Arguments
x |
a character or factor vector |
dictionary |
a data frame or named list of data frames with at least two
columns defining the word list to be used. If this is a data frame, a third
column must be present to split the dictionary by column in |
from |
a column name or position defining words or keys to be replaced |
to |
a column name or position defining replacement values |
by |
character or integer. If |
order |
a character the column to be used for sorting the values in each data frame. If the incoming variables are factors, this determines how the resulting factors will be sorted. |
warn |
if |
Details
By default, this applies the function match_vec()
to all
columns specified by the column names listed in by
, or, if a
global dictionary is used, this includes all character
and factor
columns as well.
by
column
Spelling variables within dictionary
represent keys that you want to match
to column names in x
(the data set). These are expected to match exactly
with the exception of two reserved keywords that starts with a full stop:
-
.regex [pattern]
: any column whose name is matched by[pattern]
. The[pattern]
should be an unquoted, valid, PERL-flavored regular expression. -
.global
: any column (see Section Global dictionary)
Global dictionary
A global dictionary is a set of definitions applied to all valid columns of
x
indiscriminantly.
-
.global keyword in
by
: If you want to apply a set of definitions to all valid columns in addition to specified columns, then you can include a.global
group in theby
column of yourdictionary
data frame. This is useful for setting up a dictionary of common spelling errors. NOTE: specific variable definitions will override global defintions. For example: if you have a column for cardinal directions and a definiton forN = North
, then the global variableN = no
will not override that. See Example. -
by = NULL
: If you want your data frame to be applied to all character/factor columns indiscriminantly, then settingby = NULL
will use that dictionary globally.
Value
a data frame with re-defined data based on the dictionary
Author(s)
Zhian N. Kamvar
Patrick Barks
See Also
match_vec()
, which this function wraps.
Examples
# Read in dictionary and coded date examples --------------------
dict <- read.csv(matchmaker_example("spelling-dictionary.csv"),
stringsAsFactors = FALSE)
dat <- read.csv(matchmaker_example("coded-data.csv"),
stringsAsFactors = FALSE)
dat$date <- as.Date(dat$date)
# Clean spelling based on dictionary -----------------------------
dict # show the dict
head(dat) # show the data
res1 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp")
head(res1)
# Show warnings/errors from each column --------------------------
# Internally, the `match_vec()` function can be quite noisy with warnings for
# various reasons. Thus, by default, the `match_df()` function will keep
# these quiet, but you can have them printed to your console if you use the
# warn = TRUE option:
res1 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp",
warn = TRUE)
head(res1)
# You can ensure the order of the factors are correct by specifying
# a column that defines order.
dat[] <- lapply(dat, as.factor)
as.list(head(dat))
res2 <- match_df(dat,
dictionary = dict,
from = "options",
to = "values",
by = "grp",
order = "orders")
head(res2)
as.list(head(res2))