UNCOVER {UNCOVER} | R Documentation |
Utilising Normalisation Constant Optimisation Via Edge Removal
Description
Generates cohorts for a data set through removal of edges from a graphical representation of the co-variates. Edges are removed (or reintroduced) by considering the normalisation constant (or Bayesian evidence) of a multiplicative Bayesian logistic regression model.
The first stage of the function is concerned purely with a greedy optimisation of the Bayesian evidence through edge manipulation. The second stage then addresses any other criteria (known as deforestation conditions) expressed by the user through reintroduction of edges.
Usage
UNCOVER(
X,
y,
mst_var = NULL,
options = UNCOVER.opts(),
stop_criterion = 5,
deforest_criterion = "None",
prior_mean = rep(0, ncol(X) + 1),
prior_var = diag(ncol(X) + 1),
verbose = TRUE
)
Arguments
X |
Co-variate matrix |
y |
Binary response vector |
mst_var |
A vector specifying which variables of the co-variate matrix will be used to form the graph. If not specified all variables will be used. |
options |
Additional arguments that can be specified for |
stop_criterion |
What is the maximum number of clusters allowed before we terminate the first stage and begin deforestation. Defaults to 5. |
deforest_criterion |
Constraint type which the final model must satisfy.
Can be one of |
prior_mean |
Mean for the multivariate normal prior used in the SMC sampler. See details. Defaults to the origin. |
prior_var |
Variance matrix for the multivariate normal prior used in the SMC sampler. See details. Defaults to the identity matrix. |
verbose |
Do you want the progress of the algorithm to be shown?
Defaults to |
Details
Assumes a Bayesian logistic regression model for each cohort, with the overall model being a product of these sub-models.
A minimum spanning tree graph is first constructed from a subset of the co-variates. Then at each iteration, each edge in the current graph is checked to see if removal to split a cohort is beneficial, and then either we selected the optimal edge to remove or we conclude it is not beneficial to remove any more edges. At the end of each iteration we also check the set of removed edges to see if it is beneficial to reintroduce any previously removed edges. After this process has ended we then reintroduce edges in the removed set specifically to meet the criteria set by the user in the most optimal manner possible through a greedy approach. For more details see the Emerson and Aslett (2023).
The graph can be undergo deforestation to meet 6 possible criteria:
-
"NoC"
: Number of Clusters - we specify a maximum number of clusters (options$max_K
) we can tolerate in the final output of the algorithm. -
"SoC"
: Size of Clusters - we specify a minimum number of observations (options$min_size
) we can tolerate being assigned to a cluster in the final output of the algorithm. -
"MaxReg"
: Maximal Regret - we give a maximum tolerance (exp(options$reg)
) that we allow the Bayesian evidence to decrease by reintroducing an edge. -
"Validation"
: Validation Data - we split (usingoptions$train_frac
) the data into training and validation data, apply the first stage of the algorithm on the training data and the introduce the validation data for the deforestation stage. Edges are reintroduced if they lead to improved prediction of the validation data using the training data model (i.e. we aim to maximise a robustness statistic). -
"Diverse"
: Diverse Response Class Within Clusters - We specify a minimum number of observations (options$n_min_class
) in a cluster that have the minority response class associated to them (the minimum response class is determined for each cluster). -
"None"
: No Criteria Specified - we do not go through the second deforestation stage of the algorithm.
All deforestation criteria other than "None"
require additional arguments
to be specified in options
. See examples and
UNCOVER.opts()
for more information. It is never
recommended to use anything other than
UNCOVER.opts
to provide the options
argument.
The prior used for the UNCOVER procedure will take the form of a
multivariate normal, where the parameters can be specified directly by the
user. It is however possible to override this default prior distributional
form by specifying prior.override=TRUE
and providing the relevant prior
functions in UNCOVER.opts
.
The diagnostic data frames will track various outputs of the UNCOVER
procedure depending on the deforestation criterion. All data frames will
contain an action (removal or addition of an edge to the graph) and the
total log Bayesian evidence of the model gained through deployment of that
action (for "Validation"
this will be two columns, one for the training
data and one for all of the data). "NoC"
will also track the number of
clusters, "SoC"
will track the minimum cluster size and the number of
criterion breaking clusters, "Validation"
will track the robustness
statistic and "Diverse"
will track the minimum minority class across all
clusters alongside the number of criterion breaking clusters.
Value
An object of class "UNCOVER"
, which is a list consisting of:
Covariate_Matrix
The co-variate matrix provided.
Response_Vector
The binary response vector provided.
Minimum_Spanning_Tree_Variables
A vector of indices for the co-variates used to construct the minimum spanning tree.
Control
A list of the additional arguments specified by
options
.Deforestation_Criterion
The deforestation criterion specified.
Prior_Mean
The mean of multivariate normal prior. Meaningless if prior is overridden in
options
.Prior_Variance
The variance of multivariate normal prior. Meaningless if prior is overridden in
options
.Model
List containing; the cluster allocation of the training data, the log Bayesian evidences of the sub-models, the final graph of the clustered data, the number of clusters, the edges which were removed from the graph and a diagnostics data frame (the contents of which vary depending on the deforestation criterion).
If deforest_criterion=="Validation"
then Model
is instead a list of two
lists; one containing the model information for the training data
(Training_Data
) and the other containing model information for all of the
data (All_Data
). Diagnostic information is only included in the All_Data
list.
References
Emerson, S.R. and Aslett, L.J.M. (2023). Joint cohort and prediction modelling through graphical structure analysis (to be released)
See Also
UNCOVER.opts()
, print.UNCOVER()
, predict.UNCOVER()
, plot.UNCOVER()
Examples
# First we generate a co-variate matrix and binary response vector
CM <- matrix(rnorm(200),100,2)
rv <- sample(0:1,100,replace=TRUE)
# We can then run our algorithm to see what cohorts are selected for each
# of the different deforestation criteria
UN.none <- UNCOVER(X = CM,y = rv, deforest_criterion = "None",
verbose = FALSE)
UN.noc <- UNCOVER(X = CM,y = rv, deforest_criterion = "NoC",
options = UNCOVER.opts(max_K = 3), verbose = FALSE)
UN.soc <- UNCOVER(X = CM,y = rv, deforest_criterion = "SoC",
options = UNCOVER.opts(min_size = 10), verbose = FALSE)
UN.maxreg <- UNCOVER(X = CM,y = rv, deforest_criterion = "MaxReg",
options = UNCOVER.opts(reg = 1), verbose = FALSE)
UN.validation <- UNCOVER(X = CM,y = rv, deforest_criterion = "Validation",
options = UNCOVER.opts(train_frac = 0.8),
verbose = FALSE)
UN.diverse <- UNCOVER(X = CM,y = rv, deforest_criterion = "Diverse",
options = UNCOVER.opts(n_min_class = 2),
verbose = FALSE)
clu_al_mat <- rbind(UN.none$Model$Cluster_Allocation,
UN.noc$Model$Cluster_Allocation,
UN.soc$Model$Cluster_Allocation,
UN.maxreg$Model$Cluster_Allocation,
UN.validation$Model$All_Data$Cluster_Allocation,
UN.diverse$Model$Cluster_Allocation)
# We can create a matrix where each entry shows in how many of the methods
# did the indexed observations belong to the same cluster
obs_con_mat <- matrix(0,100,100)
for(i in 1:100){
for(j in 1:100){
obs_con_mat[i,j] <- length(which(clu_al_mat[,i]-clu_al_mat[,j]==0))/6
obs_con_mat[j,i] <- obs_con_mat[i,j]
}
}
head(obs_con_mat)
# We can also view the outputted overall Bayesian evidence of the five
# models as well
c(sum(UN.none$Model$Log_Marginal_Likelihoods),
sum(UN.noc$Model$Log_Marginal_Likelihoods),
sum(UN.soc$Model$Log_Marginal_Likelihoods),
sum(UN.maxreg$Model$Log_Marginal_Likelihoods),
sum(UN.validation$Model$All_Data$Log_Marginal_Likelihoods),
sum(UN.diverse$Model$Log_Marginal_Likelihoods))
# If we don't assume the prior for the regression coefficients is a
# standard multivariate normal but instead a multivariate normal with
# different parameters
UN.none.2 <- UNCOVER(X = CM,y = rv, deforest_criterion = "None",
prior_mean = rep(1,3), prior_var = 0.5*diag(3),
verbose = FALSE)
c(sum(UN.none$Model$Log_Marginal_Likelihoods),
sum(UN.none.2$Model$Log_Marginal_Likelihoods))
# We can also specify a completely different prior, for example a
# multivariate independent uniform
rmviu <- function(n,a,b){
return(mapply(FUN = function(min.vec,max.vec,pn){
stats::runif(pn,a,b)},min.vec=a,max.vec=b,
MoreArgs = list(pn = n)))
}
dmviu <- function(x,a,b){
for(ii in 1:ncol(x)){
x[,ii] <- dunif(x[,ii],a[ii],b[ii])
}
return(apply(x,1,prod))
}
UN.none.3 <- UNCOVER(X = CM,y = rv,deforest_criterion = "None",
options = UNCOVER.opts(prior.override = TRUE,
rprior = rmviu,
dprior = dmviu,a=rep(0,3),
b=rep(1,3)),verbose = FALSE)
c(sum(UN.none$Model$Log_Marginal_Likelihoods),
sum(UN.none.2$Model$Log_Marginal_Likelihoods),
sum(UN.none.3$Model$Log_Marginal_Likelihoods))
# We may only wish to construct our minimum spanning tree based on the first
# variable
UN.none.4 <- UNCOVER(X = CM,y = rv,mst_var = 1,deforest_criterion = "None",
verbose = FALSE)
c(sum(UN.none$Model$Log_Marginal_Likelihoods),
sum(UN.none.4$Model$Log_Marginal_Likelihoods))
# Increasing the stop criterion may uncover more clustering structure within
# the data, but comes with a time cost
system.time(UNCOVER(X = CM,y = rv,stop_criterion = 4,verbose = FALSE))
system.time(UNCOVER(X = CM,y = rv,stop_criterion = 6,verbose = FALSE))