match_vec {matchmaker} | R Documentation |
Rename values in a vector based on a dictionary
Description
This function provides an interface for forcats::fct_recode()
,
forcats::fct_explicit_na()
, and forcats::fct_relevel()
in such a way that
a data dictionary can be imported from a data frame.
Usage
match_vec(
x = character(),
dictionary = data.frame(),
from = 1,
to = 2,
quiet = FALSE,
warn_default = TRUE,
anchor_regex = TRUE
)
Arguments
x |
a character or factor vector |
dictionary |
a matrix or data frame defining mis-spelled words or keys
in one column ( |
from |
a column name or position defining words or keys to be replaced |
to |
a column name or position defining replacement values |
quiet |
a |
warn_default |
a |
anchor_regex |
a |
Details
Keys (from
column)
The from
column of the dictionary will contain the keys that you want to
match in your current data set. These are expected to match exactly with
the exception of three reserved keywords that start with a full stop:
-
.regex [pattern]
: will replace anything matching[pattern]
. This is executed before any other replacements are made. The[pattern]
should be an unquoted, valid, PERL-flavored regular expression. Any whitespace padding the regular expression is discarded. -
.missing
: replaces any missing values (see NOTE) -
.default
: replaces ALL values that are not defined in the dictionary and are not missing.
Values (to
column)
The values will replace their respective keys exactly as they are presented.
There is currently one recognised keyword that can be placed in the to
column of your dictionary:
-
.na
: Replace keys with missing data. When used in combination with the.missing
keyword (in column 1), it can allow you to differentiate between explicit and implicit missing data.
Value
a vector of the same type as x
with mis-spelled labels cleaned.
Note that factors will be arranged by the order presented in the data
dictionary; other levels will appear afterwards.
Note
If there are any missing values in the from
column (keys), then they
are automatically converted to the character "NA" with a warning. If you want
to target missing data with your dictionary, use the .missing
keyword. The
.regex
keyword uses gsub()
with the perl = TRUE
option for replacement.
Author(s)
Zhian N. Kamvar
See Also
match_df()
for an implementation that acts across
multiple variables in a data frame.
Examples
corrections <- data.frame(
bad = c("foubar", "foobr", "fubar", "unknown", ".missing"),
good = c("foobar", "foobar", "foobar", ".na", "missing"),
stringsAsFactors = FALSE
)
corrections
# create some fake data
my_data <- c(letters[1:5], sample(corrections$bad[-5], 10, replace = TRUE))
my_data[sample(6:15, 2)] <- NA # with missing elements
match_vec(my_data, corrections)
# You can use regular expressions to simplify your list
corrections <- data.frame(
bad = c(".regex f[ou][^m].+?r$", "unknown", ".missing"),
good = c("foobar", ".na", "missing"),
stringsAsFactors = FALSE
)
# You can also set a default value
corrections_with_default <- rbind(corrections, c(bad = ".default", good = "unknown"))
corrections_with_default
# a warning will be issued about the data that were converted
match_vec(my_data, corrections_with_default)
# use the warn_default = FALSE, if you are absolutely sure you don't want it.
match_vec(my_data, corrections_with_default, warn_default = FALSE)
# The function will give you a warning if the dictionary does not
# match the data
match_vec(letters, corrections)
# The can be used for translating survey output
words <- data.frame(
option_code = c(".regex ^[yY][eE]?[sS]?",
".regex ^[nN][oO]?",
".regex ^[uU][nN]?[kK]?",
".missing"),
option_name = c("Yes", "No", ".na", "Missing"),
stringsAsFactors = FALSE
)
match_vec(c("Y", "Y", NA, "No", "U", "UNK", "N"), words)