clean {cleaner}R Documentation

Clean column data to a class

Description

Use any of these functions to quickly clean columns in your data set. Use clean() to pick the functions that return the least relative number of NAs. They always return the class from the function name (e.g. clean_Date() always returns class Date).

Usage

clean(x)

## S3 method for class 'data.frame'
clean(x)

clean_logical(
  x,
  true = regex_true(),
  false = regex_false(),
  na = NULL,
  fixed = FALSE,
  ignore.case = TRUE
)

clean_factor(
  x,
  levels = unique(x),
  ordered = FALSE,
  droplevels = FALSE,
  fixed = FALSE,
  ignore.case = TRUE
)

clean_numeric(x, remove = "[^0-9.,-]", fixed = FALSE)

clean_double(x, remove = "[^0-9.,-]", fixed = FALSE)

clean_integer(x, remove = "[^0-9.,-]", fixed = FALSE)

clean_character(
  x,
  remove = "[^a-z \t\r\n]",
  fixed = FALSE,
  ignore.case = TRUE,
  trim = TRUE
)

clean_currency(x, currency_symbol = NULL, remove = "[^0-9.,-]", fixed = FALSE)

clean_percentage(x, remove = "[^0-9.,-]", fixed = FALSE)

clean_Date(x, format = NULL, guess_each = FALSE, max_date = Sys.Date(), ...)

clean_POSIXct(
  x,
  tz = "",
  remove = "[^.0-9 :/-]",
  fixed = FALSE,
  max_date = Sys.Date(),
  ...
)

Arguments

x

data to clean

true

regex to interpret values as TRUE (which defaults to regex_true), see Details

false

regex to interpret values as FALSE (which defaults to regex_false), see Details

na

regex to force interpret values as NA, i.e. not as TRUE or FALSE

fixed

logical to indicate whether regular expressions should be turned off

ignore.case

logical to indicate whether matching should be case-insensitive

levels

new factor levels, may be named with regular expressions to match existing values, see Details

ordered

logical to indicate whether the factor levels should be ordered

droplevels

logical to indicate whether non-existing factor levels should be dropped

remove

regex to define the character(s) that should be removed, see Details

trim

logical to indicate whether the result should be trimmed with trimws(..., which = "both")

currency_symbol

the currency symbol to use, which will be guessed based on the input and otherwise defaults to the current system locale setting (see Sys.localeconv)

format

character string giving a date-time format as used by strptime.

For clean_Date(..., guess_each = TRUE), this can be a vector of values to be used for guessing, see Examples.

guess_each

logical to indicate whether all items of x should be guessed one by one, see Examples

max_date

date (coercible with [as.Date()]) to indicate to maximum allowed of x, which defaults to today. This is to prevent that clean_Date("23-03-47") will return 23 March 2047 and instead returns 23 March 1947 with a warning.

...

for clean_Date and clean_POSIXct: other parameters passed on these functions

tz

time zone specification to be used for the conversion, if one is required. System-specific (see time zones), but "" is the current time zone, and "GMT" is UTC (Universal Time, Coordinated). Invalid values are most commonly treated as UTC, on some platforms with a warning.

Details

Using clean() on a vector will guess a cleaning function based on the potential number of NAs it returns. Using clean() on a data.frame to apply this guessed cleaning over all columns.

Info about the different functions:

The use of invalid regular expressions in any of the above functions will not return an error (like in base R), but will instead interpret the expression as a fixed value and will throw a warning.

Value

The clean_* functions always return the class from the function name:

Source

Triennial Central Bank Survey Foreign exchange turnover in April 2016 (PDF). Bank for International Settlements. 11 December 2016. p. 10.

Examples

clean_logical(c("Yes", "No"))   # English
clean_logical(c("Oui", "Non"))  # French
clean_logical(c("ya", "tidak")) # Indonesian
clean_logical(x = c("Positive", "Negative", "Unknown", "Some value"),
              true = "pos", false = "neg")

gender_age <- c("male 0-50", "male 50+", "female 0-50", "female 50+")
clean_factor(gender_age, c("M", "F"))
clean_factor(gender_age, c("Male", "Female"))
clean_factor(gender_age, c("0-50", "50+"), ordered = TRUE)

clean_Date("13jul18", "ddmmmyy")
clean_Date("12 August 2010")
clean_Date("12 06 2012")
clean_Date("October 1st 2012")
clean_Date("43658")
clean_Date("14526", "Excel")
clean_Date(c("1 Oct 13", "October 1st 2012")) # could not be fitted in 1 format
clean_Date(c("1 Oct 13", "October 1st 2012"), guess_each = TRUE)
clean_Date(c("12-14-13", "1 Oct 2012"), 
           guess_each = TRUE,
           format = c("d mmm yyyy", "mm-yy-dd")) # only these formats will be tried

clean_POSIXct("Created log on 2020/02/11 11:23 by user Joe")
clean_POSIXct("Created log on 2020.02.11 11:23 by user Joe", tz = "UTC")

clean_numeric("qwerty123456")
clean_numeric("Positive (0.143)")
clean_numeric("0,143")
clean_numeric("minus 12 degrees")

clean_percentage("PCT: 0.143")
clean_percentage(c("Total of -12.3%", "Total of +4.5%"))

clean_character("qwerty123456")
clean_character("Positive (0.143)")

clean_currency(c("Received 25", "Received 31.40"))
clean_currency(c("Jack sent £ 25", "Bill sent £ 31.40"))

df <- data.frame(A = c("2 Apr 2016", "5 Feb 2020"), 
                 B = c("yes", "no"),
                 C = c("Total of -12.3%", "Total of +4.5%"),
                 D = c("Marker: 0.4513 mmol/l", "Marker: 0.2732 mmol/l"))
df
clean(df)

[Package cleaner version 1.5.4 Index]