apply_quality_ctrl {eHDPrep} | R Documentation |
Apply quality control measures to a dataset
Description
The primary high level function for quality control. Applies several quality control functions in sequence to input data frame (see Details for individual functions).
Usage
apply_quality_ctrl(
data,
id_var,
class_tbl,
bin_cats = NULL,
min_freq = 1,
to_numeric_matrix = FALSE
)
Arguments
data |
A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). |
id_var |
An unquoted expression which corresponds to a variable (column) in
|
class_tbl |
data frame such as the output tibble from
|
bin_cats |
Optional named vector of user-defined values for binary
values using |
min_freq |
Minimum frequency of occurrence
|
to_numeric_matrix |
Should QC'ed data be converted to a numeric matrix? Default: FALSE. |
Details
The wrapped functions are applied in the following order:
Standardise missing values (
strings_to_NA
)Encode binary categorical variables (columns) (
encode_binary_cats
)Encode (specific) ordinal variables (columns)(
encode_ordinals
)Encode genotype variables (
encode_genotypes
)Extract information from free text variables (columns) (
extract_freetext
)Encode non-binary categorical variables (columns) (
encode_cats
)Encode output as numeric matrix (optional,
encode_as_num_mat
)
class_tbl
is used to apply the above functions to the appropriate variables (columns).
Value
data
with several QC measures applied.
See Also
Other high level functionality:
assess_quality()
,
review_quality_ctrl()
,
semantic_enrichment()
Examples
data(example_data)
require(tibble)
# create an example class_tbl object
# note that diabetes_type is classes as ordinal and is not modified as its
# levels are not pre-coded
tibble::tribble(~"var", ~"datatype",
"patient_id", "id",
"tumoursize", "numeric",
"t_stage", "ordinal_tstage",
"n_stage", "ordinal_nstage",
"diabetes", "factor",
"diabetes_type", "ordinal",
"hypertension", "factor",
"rural_urban", "factor",
"marital_status", "factor",
"SNP_a", "genotype",
"SNP_b", "genotype",
"free_text", "freetext") -> data_types
data_QC <- apply_quality_ctrl(example_data, patient_id, data_types,
bin_cats =c("No" = "Yes", "rural" = "urban"), min_freq = 0.6)