optimal_kmeans_d {riskclustr} | R Documentation |
Obtain optimal D solution based on k-means clustering of disease marker data in a case-control study
Description
optimal_kmeans_d
applies k-means clustering using the
kmeans
function with many random starts. The D value is
then calculated for the cluster solution at each random start using the
d
function, and the cluster solution that maximizes D is returned,
along with the corresponding value of D. In this way the optimally
etiologically heterogeneous subtype solution can be identified from possibly
high-dimensional disease marker data.
Usage
optimal_kmeans_d(markers, M, factors, case, data, nstart = 100, seed = NULL)
Arguments
markers |
a vector of the names of the disease markers. These markers
should be of a type that is suitable for use with
|
M |
is the number of clusters to identify using
|
factors |
a list of the names of the binary or continuous risk factors.
For binary risk factors the lowest level will be used as the reference level.
e.g. |
case |
denotes the variable that contains each subject's status as a
case or control. This value should be 1 for cases and 0 for controls.
Argument must be supplied in quotes, e.g. |
data |
the name of the dataframe that contains the relevant variables. |
nstart |
the number of random starts to use with
|
seed |
an integer argument passed to |
Value
Returns a list
optimal_d
The D value for the optimal D solution
optimal_d_data
The original data frame supplied through the
data
argument, with a column called optimal_d_label
added for the optimal D subtype label.
This has the subtype assignment for cases, and is 0 for all controls.
References
Begg, C. B., Zabor, E. C., Bernstein, J. L., Bernstein, L., Press, M. F., & Seshan, V. E. (2013). A conceptual and methodological framework for investigating etiologic heterogeneity. Stat Med, 32(29), 5039-5052.
Examples
# Cluster 30 disease markers to identify the optimally
# etiologically heterogeneous 3-subtype solution
res <- optimal_kmeans_d(
markers = c(paste0("y", seq(1:30))),
M = 3,
factors = list("x1", "x2", "x3"),
case = "case",
data = subtype_data,
nstart = 100,
seed = 81110224
)
# Look at the value of D for the optimal D solution
res[["optimal_d"]]
# Look at a table of the optimal D solution
table(res[["optimal_d_data"]]$optimal_d_label)