betaclust {betaclust} | R Documentation |
The betaclust wrapper function
Description
A family of model-based clustering techniques to identify methylation states in beta-valued DNA methylation data.
Usage
betaclust(
data,
M = 3,
N,
R,
model_names = "K..",
model_selection = "BIC",
parallel_process = FALSE,
seed = NULL
)
Arguments
data |
A dataframe of dimension |
M |
Number of methylation states to be identified in a DNA sample type. |
N |
Number of patients in the study. |
R |
Number of sample types collected from each patient for the study. |
model_names |
Models to run from the set of models, K.., KN. and K.R, default = K.. . See details. |
model_selection |
Information criterion used for model selection. Options are AIC, BIC or ICL (default = BIC). |
parallel_process |
The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations. |
seed |
Seed to allow for reproducibility (default = NULL). |
Details
This is a wrapper function which can be used to fit all three models (K.., KN., K.R) within a single function.
The K.. and KN. models are used to analyse a single DNA sample type (R = 1
) and cluster the C
CpG sites into the K
clusters which represent the different methylation states in a DNA sample type. As each CpG site can belong to any of the M=3
methylation states (hypomethylation, hemimethylation and hypermethylation), the default value for K=M=3
.
The thresholds between methylation states are objectively inferred from the clustering solution.
The K.R model is used to analyse R
independent sample types collected from N
patients, where each sample contains C
CpG sites, and cluster
the dataset into K=M^R
clusters to identify the differentially methylated CpG (DMC) sites between the R
DNA sample types.
Value
The function returns an object of the betaclust
class which contains the following values:
information_criterion - The information criterion used to select the optimal model.
ic_output - The information criterion value calculated for each model.
optimal_model - The model selected as optimal.
function_call - The parameters passed as arguments to the function
betaclust
.K - The number of clusters identified using the beta mixture models.
C - The number of CpG sites analysed using the beta mixture models.
N - The number of patients analysed using the beta mixture models.
R - The number of sample types analysed using the beta mixture models.
optimal_model_results - Information from the optimal model. Specifically,
cluster_size - The total number of CpG sites in each of the K clusters.
llk - A vector containing the log-likelihood value at each step of the EM algorithm.
alpha - This contains the first shape parameter for the beta mixture model.
delta - This contains the second shape parameter for the beta mixture model.
tau - The proportion of CpG sites in each cluster.
z - A matrix of dimension
C \times K
containing the posterior probability of each CpG site belonging to each of theK
clusters.classification - The classification corresponding to z, i.e. map(z).
uncertainty - The uncertainty of each CpG site's clustering.
thresholds - Threshold points calculated under the K.. or the KN. model.
DM - The AUC and WD metric for distribution similarity in each cluster.
References
Silva, R., Moran, B., Russell, N.M., Fahey, C., Vlajnic, T., Manecksha, R.P., Finn, S.P., Brennan, D.J., Gallagher, W.M., Perry, A.S.: Evaluating liquid biopsies for methylomic profiling of prostate cancer. Epigenetics 15(6-7), 715-727 (2020). doi:10.1080/15592294.2020.1712876.
Majumdar, K., Silva, R., Perry, A.S., Watson, R.W., Murphy, T.B., Gormley, I.C.: betaclust: a family of mixture models for beta valued DNA methylation data. arXiv [stat.ME] (2022). doi:10.48550/ARXIV.2211.01938.
See Also
Examples
my.seed <- 190
M <- 3
N <- 4
R <- 2
data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R,
model_names = c("K..","KN.","K.R"), model_selection = "BIC",
parallel_process = FALSE, seed = my.seed)