medic {tame} | R Documentation |
Medication clustering (based on ATC and timing)
Description
The medic
method uses agglomerative hierarchical clustering with a
bespoke distance measure based on medication ATC codes similarities,
medication timing and medication amount or dosage.
Usage
medic(
data,
k = 5,
id,
atc,
timing,
base_clustering,
linkage = "complete",
summation_method = "sum_of_minima",
alpha = 1,
beta = 1,
gamma = 1,
p = 1,
theta = (5:0)/5,
parallel = FALSE,
return_distance_matrix = FALSE,
set_seed = FALSE,
...
)
Arguments
data |
A data frame containing all the variables for the clustering. |
k |
a vector specifying the number of clusters to identify. |
id |
< |
atc |
< |
timing |
< |
base_clustering |
< |
linkage |
The agglomeration method to be used in the clustering. This should be (an unambiguous abbreviation of) one of "ward.D", "ward.D2", "single", "complete", "average" (= UPGMA), "mcquitty" (= WPGMA), "median" (= WPGMC) or "centroid" (= UPGMC). See stats::hclust for more information. For a discussion of linkage criterion choice see details below. |
summation_method |
The summation method used in the distance measure. This should be either "double_sum" or "sum_of_minima". See details below for more information. |
alpha |
A number giving the tuning of the normalization. See details below for more information. |
beta |
A number giving the power of the individual medication combinations. See details below for more information. |
gamma |
A number giving the weight of the timing terms. See details below for more information. |
p |
The power of the Minkowski distance used in the timing-specific distance. See details below for more information. |
theta |
A vector of length 6 specifying the tuning of the ATC measure. See details below for more information. |
parallel |
A logical or an integer. If If |
return_distance_matrix |
A logical. |
set_seed |
A logical or an integer. |
... |
Additional arguments not currently in use. |
Details
The medic
method uses agglomerative hierarchical
clustering with a bespoke distance measure based on medication ATC codes and
timing similarities to assign medication pattern clusters to people.
Two versions of the distance measure are available:
The double sum:
%
d(p_i, p_j) = N_{\alpha}(M_i \times M_j) \sum_{m\in M_i}\sum_{n \in M_j}%
((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta}.%
and the sum of minima:
%
d(p_i, p_j) = \frac{1}{2}(N_{\alpha}(M_i)\sum_{m\in M_i}\min_{n \in M_j}%
((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta} +
N_{\alpha}(M_j) \sum_{n\in M_j}\min_{m \in M_i}%
((1 + D_{\theta}(m,n)) (1 + \gamma T_p(t_{im},t_{jn})) - 1)^{\beta}).%
Normalization
%
N_{\alpha}(x) = |x|^{-\alpha}%
If the normalization tuning, alpha
, is 0, then no normalization is
preformed and the distance measure becomes highly dependent on the number of
distinct medications given. That is, people using more medication will have
larger distances to others. If the normalization tuning, alpha
, is 1 -
the default - then the summation is normalized with the number of terms in
the sum, in other words, the average is calculated.
ATC distance
The central idea of this method, namely the ATC distance, is given as
%
D_{\theta}(x, y) = \sum_{i=1,...,5}1\{x and y match on level i, but not level i + 1\}\theta_i%
The ATC distance is tuned using the vector theta
.
Note that two ATC codes are said to match at level i when they are identical at level i. E.g. the two codes N06AB01 and N06AA01 match on level 1, 2, and 3 as they are both "N" at level 1, "N06" at level 2, and "N06A" at level 3, but at level 4 they differ ("N06AB" and "N06AA" are not the same).
Timing distance
The timing distance is a simple Minkowski distance:
%
T(x,y) =(\sum_{t \in T} |x_t - y_t|^p)^{1/p}.%
When p
is 1, the default, the Manhattan distance is used.
Value
An object of class medic which describes the clusters produced the hierarchical clustering process. The object is a list with components:
- data
the inputted data frame
data
with the cluster assignments appended at the end.- clustering
a data frame with the person id as given by
id
, the.analysis_order
and the clusters found.- variables
a list of the variables used in the clustering.
- parameters
a data frame with all the inputted clustering parameters and the corresponding method names. These method names correspond to the column names for each cluster in the
clustering
data frame described right above.- key
a list of keys used internally in the function to keep track of simplified versions of the data.
- distance_matrix
the distance matrices for each method if
return_distance_matrix
isTRUE
otherwiseNULL
.- call
the matched call.
See Also
summary.medic for summaries and plots.
employ for employing an existing clustering to new data.
enrich for enriching the meta data in the medic
object with additional
data.
bind for binding together two comparable lists of clusterings.
Examples
# A simple clustering based only on ATC
clust <- medic(complications, id = id, atc = atc, k = 3)
# A simple clustering with both ATC and timing
clust <- medic(
complications,
id = id,
atc = atc,
timing = first_trimester:third_trimester,
k = 3
)