betaclust {betaclust} | R Documentation |
A family of model-based clustering techniques to identify methylation states in beta-valued DNA methylation data.
betaclust(
data,
M = 3,
N,
R,
model_names = "K..",
model_selection = "BIC",
parallel_process = FALSE,
seed = NULL
)
data |
A dataframe of dimension |
M |
Number of methylation states to be identified in a DNA sample. |
N |
Number of patients in the study. |
R |
Number of samples collected from each patient for the study. |
model_names |
Models to run from the set of models, K.., KN. and K.R, default = K.. . See details. |
model_selection |
Information criterion used for model selection. Options are AIC, BIC or ICL (default = BIC). |
parallel_process |
The "TRUE" option results in parallel processing of the models for increased computational efficiency. The default option has been set as "FALSE" due to package testing limitations. |
seed |
Seed to allow for reproducibility (default = NULL). |
This is a wrapper function which can be used to fit all three models (K.., KN., K.R) within a single function.
The K.. and KN. models are used to analyse a single DNA sample (R = 1
) and cluster the C
CpG sites into the K
clusters which represent the different methylation states in a DNA sample. As each CpG site can belong to any of the M=3
methylation states (hypomethylation, hemimethylation and hypermethylation), the default value for K=M=3
.
The thresholds between methylation states are objectively inferred from the clustering solution.
The K.R model is used to analyse R
independent samples collected from N
patients, where each sample contains C
CpG sites, and cluster
the dataset into K=M^R
clusters to identify the differentially methylated CpG (DMC) sites between the R
DNA samples.
The function returns an object of the betaclust
class which contains the following values:
information_criterion - The information criterion used to select the optimal model.
ic_output - The information criterion value calculated for each model.
optimal_model - The model selected as optimal.
function_call - The parameters passed as arguments to the function betaclust
.
K - The number of clusters identified using the beta mixture models.
C - The number of CpG sites analysed using the beta mixture models.
N - The number of patients analysed using the beta mixture models.
R - The number of samples analysed using the beta mixture models.
optimal_model_results - Information from the optimal model. Specifically,
cluster_size - The total number of CpG sites in each of the K clusters.
llk - A vector containing the log-likelihood value at each step of the EM algorithm.
alpha - This contains the first shape parameter for the beta mixture model.
delta - This contains the second shape parameter for the beta mixture model.
tau - The proportion of CpG sites in each cluster.
z - A matrix of dimension C \times K
containing the posterior probability of each CpG site belonging to each of the K
clusters.
classification - The classification corresponding to z, i.e. map(z).
uncertainty - The uncertainty of each CpG site's clustering.
thresholds - Threshold points calculated under the K.. or the KN. model.
Silva, R., Moran, B., Russell, N.M., Fahey, C., Vlajnic, T., Manecksha, R.P., Finn, S.P., Brennan, D.J., Gallagher, W.M., Perry, A.S.: Evaluating liquid biopsies for methylomic profiling of prostate cancer. Epigenetics 15(6-7), 715-727 (2020). doi: 10.1080/15592294.2020.1712876.
Majumdar, K., Silva, R., Perry, A.S., Watson, R.W., Murphy, T.B., Gormley, I.C.: betaclust: a family of mixture models for beta valued DNA methylation data. arXiv [stat.ME] (2022). doi: 10.48550/ARXIV.2211.01938.
my.seed <- 190
M <- 3
N <- 4
R <- 2
data_output <- betaclust(pca.methylation.data[1:30,2:9], M, N, R,
model_names = c("K..","KN.","K.R"), model_selection = "BIC",
parallel_process = FALSE, seed = my.seed)