define_case {healthdb} | R Documentation |
Identify diseases/events from administrative records
Description
This function is a composite of identify_row()
, exclude()
, restrict_n()
, and restrict_date()
. It is aimed to implement common case definitions in epidemiological studies using administrative database as a one-shot big query. The intended use case is for definitions in the form of, e.g., two or more physician visits with some diagnostic code at least 30 days apart within two years. The component functions mentioned above are chained in the following order if all arguments were supplied: identify_row(vals) %>% exclude(identify_row(excl_vals), by = clnt_id) %>% restrict_n() %>% restrict_date()
. Only necessary steps in the chain will be ran if some arguments are missing, see the verbose output for what was done. Note that if date_var
is supplied, n_per_clnt
will be counted by distinct dates instead of number of records.
Usage
define_case(
data,
vars,
match = "in",
vals,
clnt_id,
n_per_clnt = 1,
date_var = NULL,
apart = NULL,
within = NULL,
uid = NULL,
excl_vals = NULL,
excl_args = NULL,
keep = c("all", "first", "last"),
if_all = FALSE,
mode = c("flag", "filter"),
force_collect = FALSE,
verbose = getOption("healthdb.verbose"),
...
)
Arguments
data |
Data.frames or remote tables (e.g., from |
vars |
An expression passing to |
match |
One of "in", "start", "regex", "like", "between", and "glue_sql". It determines how values would be matched. See |
vals |
Depending on |
clnt_id |
Grouping variable (quoted/unquoted). |
n_per_clnt |
A single number specifying the minimum number of group size. |
date_var |
Variable name (quoted/unquoted) for the dates to be interpreted. |
apart |
An integer specifying the minimum gap (in days) between adjacent dates in a draw. |
within |
An integer specifying the maximum time span (in days) of a draw. |
uid |
Variable name for a unique row identifier. It is necessary for SQL to produce consistent result based on sorting. |
excl_vals |
Same as |
excl_args |
A named list of arguments passing to the second |
keep |
One of:
|
if_all |
A logical for whether combining the predicates (if multiple columns were selected by vars) with AND instead of OR. Default is FALSE, e.g., var1 in vals OR var2 in vals. |
mode |
Either:
|
force_collect |
A logical for whether force downloading the result table if it is not a local data.frame. Downloading data could be slow, so the user has to opt in; default is FALSE. |
verbose |
A logical for whether printing explanation for the operation. Default is fetching from options. Use |
... |
Additional arguments, e.g., |
Value
A subset of input data satisfied the specified case definition.
Examples
sample_size <- 30
df <- data.frame(
clnt_id = rep(1:3, each = 10),
service_dt = sample(seq(as.Date("2020-01-01"), as.Date("2020-01-31"), by = 1),
size = sample_size, replace = TRUE
),
diagx = sample(letters, size = sample_size, replace = TRUE),
diagx_1 = sample(c(NA, letters), size = sample_size, replace = TRUE),
diagx_2 = sample(c(NA, letters), size = sample_size, replace = TRUE)
)
# define from one source
define_case(df,
vars = starts_with("diagx"), "in", vals = letters[1:4],
clnt_id = clnt_id, date_var = service_dt,
excl_args = list(if_all = TRUE),
# remove non-case
mode = "filter",
# keeping the first record
keep = "first"
)
# multiple sources with purrr::pmap
# arguments with length = 1 will be recycle to match the number of sources
# wrap expressions/unquoted variables with bquote(),
# or rlang:exprs() to prevent immediate evaluation,
# or just use quoted variable names
purrr::pmap(
list(
data = list(df, df),
vars = rlang::exprs(starts_with("diagx")),
match = c("in", "start"),
vals = list(letters[1:4], letters[5:10]),
clnt_id = list(bquote(clnt_id)), n_per_clnt = c(2, 3),
date_var = "service_dt",
excl_vals = list(letters[11:13], letters[14:16]),
excl_args = list(list(if_all = TRUE), list(if_all = FALSE))
),
define_case
)