dcr_app {datacleanr} | R Documentation |
Interactive and reproducible data cleaning
Description
Launches the datacleanr
app for interactive and reproducible cleaning.
See Details for more information.
Usage
dcr_app(dframe, browser = TRUE)
Arguments
dframe |
Character, a string naming a |
browser |
logical, should app start in OS's default browser? (default |
Details
datacleanr
provides an interactive data overview, and allows
reproducible filtering and (manual, interactive) visual outlier detection and annotation across multiple app tabs:
-
Overview and Set-up: set groups (see below) and generate a exploratory summary of
dframe
-
Filtering: Provide and apply filter statements (groupwise, see below and
filter_scoped_df
) -
Visualization and Annotating: interactive visualization allowing outlier highlighting, annotating and before/after histograms of displayed (numeric) variables
-
Extraction: generates Reproducible Recipe and outputs
For data sets exceeding 1.5 million rows, we suggest splitting the data, if possible, by a grouping factor.
This is because at this volume interactive visualizations using plotly
stretch the limits of what modern web browsers can handle.
A simple example using iris
is:
iris_split <- split(iris, iris$Species) dcr_app(iris_split[[1]]) # or lapply(iris_split, dcr_app)
Extensive documentation is provided on each of the tabs for individual procedures in help links.
datacleanr
relies on 1) generating a column of unique IDs (.dcrkey
) and subsetting dframe
into sub-groups (generated in-app,
added as column .dcrindex
) for filtering and visualization.
These groups are composed of unique combinations of columns in the data set (must be factor
) and are passed to group_by
,
and are carried through the app for exploratory analyses (tab Overview and Set-up), filtering (tab Filtering) and plotting
(tab Visualization).
These groups should ideally be chosen to facilitate a convenient filtering and viewing/cleaning process.
For example, a data set with time series of multiple sensors could be grouped by sensor and/or additional columns,
such that periods of interest can be visualized and cleaned simultaneously in the interactive plot.
Filtering is achieved by providing expressions that evaluate to TRUE
\ FALSE
, and can be applied to the entire
data set, or individual/all groups via scoped filtering (see filter_scoped_df
).
The interactive visualization allows selecting and deselecting points with lasso and box select tools, as well as interactive zooming (toolbar or clicking on legend items or group overview table, see tab in-app) as well as panning (toolbar and hover over plot's axes). Data formats supported are
Observational (numeric), timeseries (
POSIXct
) and categorical data inx
andy
dimensions/axisObservational (numeric) data in
z
dimension (point size)Spatial data, when
lon
andlat
in decimal degrees are present inx
andy
.
Displaying spatial data requires a Mapbox account, from which an access token needs
to be copied into your .Renviron
(e.g. MAPBOX_TOKEN=your_copied_token
).
Note, that when a column .dcrflag
(logical, TRUE
\ FALSE
) is present in dframe
,
respective observations are given contrasting
symbols (FALSE
= circle, TRUE
= star-triangle).
This column is employed as a cross-referencing tool for e.g.other outlier detection or data-processing algorithms
that were applied prior.
The tab Extraction provides code to reproduce the entire procedure (a Reproducible Recipe), which
can be copied, or sent directly to an active
RStudio
script when used interactively (i.e. whendframe
is an object inR
's environment),can be saved to disk with intermediate outputs (filter statements and selected outliers), where file names are based on the input file and configurable suffixes when
dframe
is a path.
Value
When datacleanr
is ended by clicking on Close
in the app's navigation bar, a list is invisibly returned
with the following items:
-
df_name: character, object name/file path passed into
dcr_app
-
dcr_df: tibble, filtered data set with additional columns
.dcrkey
,.dcrindex
,.annotation
- the latter isNA
for non-outliers, an empty string for outliers without annotation, and a custom string for annotated outliers -
dcr_selected_outliers: data.frame, contains the outlier
.dcrkey
, the.annotation
and aselection_count
(integer, count incrementer) column -
dcr_groups: character, a vector defining the groups (via
group_by
) used throughoutdatacleanr
-
dcr_condition_df: tibble, with columns
filter
(character, statement used for filtering) andgroup
(list, of integers), defining groups that correspond to.dcrindex
-
dcr_code: character string, containing Reproducible Recipe