clean {cleaner} | R Documentation |
Clean column data to a class
Description
Use any of these functions to quickly clean columns in your data set. Use clean()
to pick the functions that return the least relative number of NA
s. They always return the class from the function name (e.g. clean_Date()
always returns class Date
).
Usage
clean(x)
## S3 method for class 'data.frame'
clean(x)
clean_logical(
x,
true = regex_true(),
false = regex_false(),
na = NULL,
fixed = FALSE,
ignore.case = TRUE
)
clean_factor(
x,
levels = unique(x),
ordered = FALSE,
droplevels = FALSE,
fixed = FALSE,
ignore.case = TRUE
)
clean_numeric(x, remove = "[^0-9.,-]", fixed = FALSE)
clean_double(x, remove = "[^0-9.,-]", fixed = FALSE)
clean_integer(x, remove = "[^0-9.,-]", fixed = FALSE)
clean_character(
x,
remove = "[^a-z \t\r\n]",
fixed = FALSE,
ignore.case = TRUE,
trim = TRUE
)
clean_currency(x, currency_symbol = NULL, remove = "[^0-9.,-]", fixed = FALSE)
clean_percentage(x, remove = "[^0-9.,-]", fixed = FALSE)
clean_Date(x, format = NULL, guess_each = FALSE, max_date = Sys.Date(), ...)
clean_POSIXct(
x,
tz = "",
remove = "[^.0-9 :/-]",
fixed = FALSE,
max_date = Sys.Date(),
...
)
Arguments
x |
data to clean |
true |
regex to interpret values as |
false |
regex to interpret values as |
na |
regex to force interpret values as |
fixed |
logical to indicate whether regular expressions should be turned off |
ignore.case |
logical to indicate whether matching should be case-insensitive |
levels |
new factor levels, may be named with regular expressions to match existing values, see Details |
ordered |
logical to indicate whether the factor levels should be ordered |
droplevels |
logical to indicate whether non-existing factor levels should be dropped |
remove |
regex to define the character(s) that should be removed, see Details |
trim |
logical to indicate whether the result should be trimmed with |
currency_symbol |
the currency symbol to use, which will be guessed based on the input and otherwise defaults to the current system locale setting (see |
format |
character string giving a date-time format as used by strptime. For |
guess_each |
logical to indicate whether all items of |
max_date |
date (coercible with [as.Date()]) to indicate to maximum allowed of |
... |
for |
tz |
time zone specification to be used for the conversion,
if one is required. System-specific (see time zones),
but |
Details
Using clean()
on a vector will guess a cleaning function based on the potential number of NAs
it returns. Using clean()
on a data.frame to apply this guessed cleaning over all columns.
Info about the different functions:
clean_logical()
:
Use parameterstrue
andfalse
to match values using case-insensitive regular expressions (regex). Unmatched values are consideredNA
. At default, values are matched withregex_true
andregex_false
. This allows support for values "Yes" and "No" in the following languages: Arabic, Bengali, Chinese (Mandarin), Dutch, English, French, German, Hindi, Indonesian, Japanese, Malay, Portuguese, Russian, Spanish, Telugu, Turkish and Urdu. Use parameterna
to override values asNA
that would else be matched withtrue
orfalse
. See Examples.clean_factor()
:
Use parameterlevels
to set new factor levels. They can be case-insensitive regular expressions to match existing values ofx
. For matching, new values forlevels
are internally temporary sorted descending on text length. See Examples.clean_numeric()
,clean_double()
,clean_integer()
andclean_character()
:
Use parameterremove
to match values that must be removed from the input, using regular expressions (regex). In case ofclean_numeric()
, comma's will be read as dots and only the last dot will be kept. Functionclean_character()
will keep middle spaces at default. See Examples.clean_percentage()
:
This new class works likeclean_numeric()
, but transforms it withas.percentage
, which will retain the original values, but will print them as percentages. See Examples.clean_currency()
:
This new class works likeclean_numeric()
, but transforms it withas.currency
. The currency symbol is guessed based on the most traded currencies by value (see Source): the United States dollar, Euro, Japanese yen, Pound sterling, Swiss franc, Renminbi, Swedish krona, Mexican peso, South Korean won, Turkish lira, Russian ruble, Indian rupee and the South African rand. See Examples.clean_Date()
:
Use parameterformat
to define a date format, or leave it empty to have the format guessed. Use"Excel"
to read values as Microsoft Excel dates. Theformat
parameter will be evaluated withformat_datetime
, which means that a format like"d-mmm-yy"
with be translated internally to"%e-%b-%y"
for convenience. See Examples.clean_POSIXct()
:
Use parameterremove
to match values that must be removed from the input, using regular expressions (regex). The resulting string will be coerced to a date/time element with classPOSIXct
, usingas.POSIXct()
. See Examples.
The use of invalid regular expressions in any of the above functions will not return an error (like in base R), but will instead interpret the expression as a fixed value and will throw a warning.
Value
The clean_*
functions always return the class from the function name:
clean_logical()
: classlogical
clean_factor()
: classfactor
clean_numeric()
andclean_double()
: classnumeric
clean_integer()
: classinteger
clean_character()
: classcharacter
clean_percentage()
: classpercentage
clean_currency()
: classcurrency
clean_Date()
: classDate
clean_POSIXct()
: classesPOSIXct/POSIXt
Source
Triennial Central Bank Survey Foreign exchange turnover in April 2016 (PDF). Bank for International Settlements. 11 December 2016. p. 10.
Examples
clean_logical(c("Yes", "No")) # English
clean_logical(c("Oui", "Non")) # French
clean_logical(c("ya", "tidak")) # Indonesian
clean_logical(x = c("Positive", "Negative", "Unknown", "Some value"),
true = "pos", false = "neg")
gender_age <- c("male 0-50", "male 50+", "female 0-50", "female 50+")
clean_factor(gender_age, c("M", "F"))
clean_factor(gender_age, c("Male", "Female"))
clean_factor(gender_age, c("0-50", "50+"), ordered = TRUE)
clean_Date("13jul18", "ddmmmyy")
clean_Date("12 August 2010")
clean_Date("12 06 2012")
clean_Date("October 1st 2012")
clean_Date("43658")
clean_Date("14526", "Excel")
clean_Date(c("1 Oct 13", "October 1st 2012")) # could not be fitted in 1 format
clean_Date(c("1 Oct 13", "October 1st 2012"), guess_each = TRUE)
clean_Date(c("12-14-13", "1 Oct 2012"),
guess_each = TRUE,
format = c("d mmm yyyy", "mm-yy-dd")) # only these formats will be tried
clean_POSIXct("Created log on 2020/02/11 11:23 by user Joe")
clean_POSIXct("Created log on 2020.02.11 11:23 by user Joe", tz = "UTC")
clean_numeric("qwerty123456")
clean_numeric("Positive (0.143)")
clean_numeric("0,143")
clean_numeric("minus 12 degrees")
clean_percentage("PCT: 0.143")
clean_percentage(c("Total of -12.3%", "Total of +4.5%"))
clean_character("qwerty123456")
clean_character("Positive (0.143)")
clean_currency(c("Received 25", "Received 31.40"))
clean_currency(c("Jack sent £ 25", "Bill sent £ 31.40"))
df <- data.frame(A = c("2 Apr 2016", "5 Feb 2020"),
B = c("yes", "no"),
C = c("Total of -12.3%", "Total of +4.5%"),
D = c("Marker: 0.4513 mmol/l", "Marker: 0.2732 mmol/l"))
df
clean(df)