apply_quality_ctrl {eHDPrep}R Documentation

Apply quality control measures to a dataset

Description

The primary high level function for quality control. Applies several quality control functions in sequence to input data frame (see Details for individual functions).

Usage

apply_quality_ctrl(
  data,
  id_var,
  class_tbl,
  bin_cats = NULL,
  min_freq = 1,
  to_numeric_matrix = FALSE
)

Arguments

data

A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr).

id_var

An unquoted expression which corresponds to a variable (column) in data which identifies each row (sample).

class_tbl

data frame such as the output tibble from assume_var_classes followed by import_var_classes.

bin_cats

Optional named vector of user-defined values for binary values using binary_label_1 = binary_label_2 syntax (e.g. c("No" = "Yes") would assign level 1 to "No" and 2 to "Yes"). See encode_binary_cats for defaults. Applied to variables (columns) labelled "character" or "factor" in class_tbl.

min_freq

Minimum frequency of occurrence extract_freetext will use to extract groups of proximal words in free-text from variables (columns) labelled "freetext" in class_tbl.

to_numeric_matrix

Should QC'ed data be converted to a numeric matrix? Default: FALSE.

Details

The wrapped functions are applied in the following order:

  1. Standardise missing values (strings_to_NA)

  2. Encode binary categorical variables (columns) (encode_binary_cats)

  3. Encode (specific) ordinal variables (columns)(encode_ordinals)

  4. Encode genotype variables (encode_genotypes)

  5. Extract information from free text variables (columns) (extract_freetext)

  6. Encode non-binary categorical variables (columns) (encode_cats)

  7. Encode output as numeric matrix (optional, encode_as_num_mat)

class_tbl is used to apply the above functions to the appropriate variables (columns).

Value

data with several QC measures applied.

See Also

Other high level functionality: assess_quality(), review_quality_ctrl(), semantic_enrichment()

Examples

data(example_data)
require(tibble)

# create an example class_tbl object
# note that diabetes_type is classes as ordinal and is not modified as its
# levels are not pre-coded
tibble::tribble(~"var", ~"datatype",
"patient_id", "id",
"tumoursize", "numeric",
"t_stage", "ordinal_tstage",
"n_stage", "ordinal_nstage",
"diabetes", "factor",
"diabetes_type", "ordinal",
"hypertension", "factor",
"rural_urban", "factor",
"marital_status", "factor",
"SNP_a", "genotype",
"SNP_b", "genotype",
"free_text", "freetext") -> data_types

data_QC <- apply_quality_ctrl(example_data, patient_id, data_types, 
   bin_cats =c("No" = "Yes", "rural" = "urban"),  min_freq = 0.6)

[Package eHDPrep version 1.3.3 Index]