get_prioritised_covariates {autoCovariateSelection}R Documentation

Generate the prioritised covariates from the global list of binary recurrence covariates using multiplicative bias ranking


get_prioritised_covariates function assesses the recurrence of each of the identified candidate empirical covariates based on their frequency of occurrence for each patient in the baseline period and generates three binary recurrence covariates for each of the identified candidate empirical covariates. This is the third and final step in the automated covariate selection process. The previous step of assessing recurrence and generating the binary recurrence covariates is done using the get_recurrence_covariates function. See 'Automated Covariate Selection'section below for more details regarding the overall process.


  k = 500



The input data.frame. Ideally this should be the output recurrence_data from the get_recurrence_covariates function


The variable name which contains the patient identifier in the df


The 1-D exposure (treatment/intervention) vector. The length of this vector should be equal to that of patientIdVector and outcomeVector. Also, this should be a binary vector with value of 1 for patients primary cohort 1 and 0 for those in comparator cohort. The order of this vector should resonate the order of patients in outcomeVector and patientIdVector


The 1-D outcome vector indicating whether or not the patient experienced the outcome of interest (value = 1) or not (value =0). The length of this vector should be equal to that of patientIdVector and exposureVector. The order of elements in this vector should resonate with the order of patients in exposureVector and patientIdVector


The 1-D vector with all the patient identifiers. This should contain all the patient IDs in the original two cohorts with its length and order equal to and resonating with that of exposureVector and outcomeVector


The maximum number of prioritised covariates that should be returned by the function. By default, this is 500 as described in the original paper


To prioritise covariates across data dimensions (domains) should be assessed by their potential for controlling confounding that is not conditional on exposure and other covariates. This means that the association of the covariates with the outcomes (relative risk) should be taken into consideration for quantifying the 'potential' for confounding. Relative risk weighted by the ratio of prevalence of the covariates between the two exposure groups is known as multiplicative bias. The other way to do this would be to use the absolute risk and this would have been the rather straight-forward procedure to quantify the potential for confounding. However, this method would invariably down-weight the association between the covariate and the outcome if the outcome prevalence is small and the exposure prevalence is high which is a common phenomenon seen with comparative effective research using real-world-data by retrospective cohort studies. The multiplicative bias term balances this and generates a quantity for each covariate that is reflective of its confounding potential. By ranking the multiplicative bias, the objective is to choose the top k number of covariates from this procedure. k, by default, is 500 as described in the original paper. For further theoretical details of the algorithm please refer to the original article listed below in the References section. get_recurrence_covariates is the function implementing what is described in the 'Prioritise Covariates' section of the article.


A named list containing two R objects

Automated Covariate Selection

The three steps in automated covariate selection are listed below with the functions implementing the methodology

  1. Identify candidate empirical covariates: get_candidate_covariates

  2. Assess recurrence: get_recurrence_covariates

  3. Prioritize covariates: get_prioritised_covariates


Dennis Robert


Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data Epidemiology. 2009;20(4):512-522. doi:10.1097/EDE.0b013e3181a663cc


head(rwd, 3)
basetable <- rwd %>% select(person_id, treatment, outcome_date) %>% distinct()
head(basetable, 3)
patientIds <- basetable$person_id
step1 <- get_candidate_covariates(df = rwd,  domainVarname = "domain",
eventCodeVarname = "event_code" , patientIdVarname = "person_id",
patientIdVector = patientIds,n = 100, min_num_patients = 10)
out1 <- step1$covars_data
all.equal(patientIds, step1$patientIds) #should be TRUE
step2 <- get_recurrence_covariates(df = out1,
patientIdVarname = "person_id", eventCodeVarname = "event_code",
patientIdVector = patientIds)
out2 <- step2$recurrence_data
out3 <- get_prioritised_covariates(df = out2,
patientIdVarname = "person_id", exposureVector = basetable$treatment,
outcomeVector = ifelse($outcome_date), 0,1),
patientIdVector = patientIds, k = 10)

[Package autoCovariateSelection version 1.0.0 Index]