UNCOVER {UNCOVER}R Documentation

Utilising Normalisation Constant Optimisation Via Edge Removal

Description

Generates cohorts for a data set through removal of edges from a graphical representation of the co-variates. Edges are removed (or reintroduced) by considering the normalisation constant (or Bayesian evidence) of a multiplicative Bayesian logistic regression model.

The first stage of the function is concerned purely with a greedy optimisation of the Bayesian evidence through edge manipulation. The second stage then addresses any other criteria (known as deforestation conditions) expressed by the user through reintroduction of edges.

Usage

UNCOVER(
  X,
  y,
  mst_var = NULL,
  options = UNCOVER.opts(),
  stop_criterion = 5,
  deforest_criterion = "None",
  prior_mean = rep(0, ncol(X) + 1),
  prior_var = diag(ncol(X) + 1),
  verbose = TRUE
)

Arguments

X

Co-variate matrix

y

Binary response vector

mst_var

A vector specifying which variables of the co-variate matrix will be used to form the graph. If not specified all variables will be used.

options

Additional arguments that can be specified for UNCOVER. See UNCOVER.opts() for details. Can be ignored.

stop_criterion

What is the maximum number of clusters allowed before we terminate the first stage and begin deforestation. Defaults to 5.

deforest_criterion

Constraint type which the final model must satisfy. Can be one of "NoC", "SoC", "MaxReg", "Validation", "Diverse" or "None". See details. Defaults to "None".

prior_mean

Mean for the multivariate normal prior used in the SMC sampler. See details. Defaults to the origin.

prior_var

Variance matrix for the multivariate normal prior used in the SMC sampler. See details. Defaults to the identity matrix.

verbose

Do you want the progress of the algorithm to be shown? Defaults to TRUE.

Details

Assumes a Bayesian logistic regression model for each cohort, with the overall model being a product of these sub-models.

A minimum spanning tree graph is first constructed from a subset of the co-variates. Then at each iteration, each edge in the current graph is checked to see if removal to split a cohort is beneficial, and then either we selected the optimal edge to remove or we conclude it is not beneficial to remove any more edges. At the end of each iteration we also check the set of removed edges to see if it is beneficial to reintroduce any previously removed edges. After this process has ended we then reintroduce edges in the removed set specifically to meet the criteria set by the user in the most optimal manner possible through a greedy approach. For more details see the Emerson and Aslett (2023).

The graph can be undergo deforestation to meet 6 possible criteria:

  1. "NoC": Number of Clusters - we specify a maximum number of clusters (options$max_K) we can tolerate in the final output of the algorithm.

  2. "SoC": Size of Clusters - we specify a minimum number of observations (options$min_size) we can tolerate being assigned to a cluster in the final output of the algorithm.

  3. "MaxReg": Maximal Regret - we give a maximum tolerance (exp(options$reg)) that we allow the Bayesian evidence to decrease by reintroducing an edge.

  4. "Validation": Validation Data - we split (using options$train_frac) the data into training and validation data, apply the first stage of the algorithm on the training data and the introduce the validation data for the deforestation stage. Edges are reintroduced if they lead to improved prediction of the validation data using the training data model (i.e. we aim to maximise a robustness statistic).

  5. "Diverse": Diverse Response Class Within Clusters - We specify a minimum number of observations (options$n_min_class) in a cluster that have the minority response class associated to them (the minimum response class is determined for each cluster).

  6. "None": No Criteria Specified - we do not go through the second deforestation stage of the algorithm.

All deforestation criteria other than "None" require additional arguments to be specified in options. See examples and UNCOVER.opts() for more information. It is never recommended to use anything other than UNCOVER.opts to provide the options argument.

The prior used for the UNCOVER procedure will take the form of a multivariate normal, where the parameters can be specified directly by the user. It is however possible to override this default prior distributional form by specifying prior.override=TRUE and providing the relevant prior functions in UNCOVER.opts.

The diagnostic data frames will track various outputs of the UNCOVER procedure depending on the deforestation criterion. All data frames will contain an action (removal or addition of an edge to the graph) and the total log Bayesian evidence of the model gained through deployment of that action (for "Validation" this will be two columns, one for the training data and one for all of the data). "NoC" will also track the number of clusters, "SoC" will track the minimum cluster size and the number of criterion breaking clusters, "Validation" will track the robustness statistic and "⁠Diverse"⁠ will track the minimum minority class across all clusters alongside the number of criterion breaking clusters.

Value

An object of class "UNCOVER", which is a list consisting of:

Covariate_Matrix

The co-variate matrix provided.

Response_Vector

The binary response vector provided.

Minimum_Spanning_Tree_Variables

A vector of indices for the co-variates used to construct the minimum spanning tree.

Control

A list of the additional arguments specified by options.

Deforestation_Criterion

The deforestation criterion specified.

Prior_Mean

The mean of multivariate normal prior. Meaningless if prior is overridden in options.

Prior_Variance

The variance of multivariate normal prior. Meaningless if prior is overridden in options.

Model

List containing; the cluster allocation of the training data, the log Bayesian evidences of the sub-models, the final graph of the clustered data, the number of clusters, the edges which were removed from the graph and a diagnostics data frame (the contents of which vary depending on the deforestation criterion).

If deforest_criterion=="Validation" then Model is instead a list of two lists; one containing the model information for the training data (Training_Data) and the other containing model information for all of the data (All_Data). Diagnostic information is only included in the All_Data list.

References

See Also

UNCOVER.opts(), print.UNCOVER(), predict.UNCOVER(), plot.UNCOVER()

Examples



# First we generate a co-variate matrix and binary response vector
CM <- matrix(rnorm(200),100,2)
rv <- sample(0:1,100,replace=TRUE)

# We can then run our algorithm to see what cohorts are selected for each
# of the different deforestation criteria
UN.none <- UNCOVER(X = CM,y = rv, deforest_criterion = "None",
                   verbose = FALSE)
UN.noc <- UNCOVER(X = CM,y = rv, deforest_criterion = "NoC",
                  options = UNCOVER.opts(max_K = 3), verbose = FALSE)
UN.soc <- UNCOVER(X = CM,y = rv, deforest_criterion = "SoC",
                  options = UNCOVER.opts(min_size = 10), verbose = FALSE)
UN.maxreg <- UNCOVER(X = CM,y = rv, deforest_criterion = "MaxReg",
                     options = UNCOVER.opts(reg = 1), verbose = FALSE)
UN.validation <- UNCOVER(X = CM,y = rv, deforest_criterion = "Validation",
                         options = UNCOVER.opts(train_frac = 0.8),
                         verbose = FALSE)
UN.diverse <- UNCOVER(X = CM,y = rv, deforest_criterion = "Diverse",
                       options = UNCOVER.opts(n_min_class = 2),
                       verbose = FALSE)
clu_al_mat <- rbind(UN.none$Model$Cluster_Allocation,
                    UN.noc$Model$Cluster_Allocation,
                    UN.soc$Model$Cluster_Allocation,
                    UN.maxreg$Model$Cluster_Allocation,
                    UN.validation$Model$All_Data$Cluster_Allocation,
                    UN.diverse$Model$Cluster_Allocation)
# We can create a matrix where each entry shows in how many of the methods
# did the indexed observations belong to the same cluster
obs_con_mat <- matrix(0,100,100)
for(i in 1:100){
for(j in 1:100){
obs_con_mat[i,j] <- length(which(clu_al_mat[,i]-clu_al_mat[,j]==0))/6
obs_con_mat[j,i] <- obs_con_mat[i,j]
}
}
head(obs_con_mat)

# We can also view the outputted overall Bayesian evidence of the five
# models as well
c(sum(UN.none$Model$Log_Marginal_Likelihoods),
  sum(UN.noc$Model$Log_Marginal_Likelihoods),
  sum(UN.soc$Model$Log_Marginal_Likelihoods),
  sum(UN.maxreg$Model$Log_Marginal_Likelihoods),
  sum(UN.validation$Model$All_Data$Log_Marginal_Likelihoods),
  sum(UN.diverse$Model$Log_Marginal_Likelihoods))

# If we don't assume the prior for the regression coefficients is a
# standard multivariate normal but instead a multivariate normal with
# different parameters
UN.none.2 <- UNCOVER(X = CM,y = rv, deforest_criterion = "None",
                     prior_mean = rep(1,3), prior_var = 0.5*diag(3),
                     verbose = FALSE)
c(sum(UN.none$Model$Log_Marginal_Likelihoods),
  sum(UN.none.2$Model$Log_Marginal_Likelihoods))
# We can also specify a completely different prior, for example a
# multivariate independent uniform
rmviu <- function(n,a,b){
return(mapply(FUN = function(min.vec,max.vec,pn){
                      stats::runif(pn,a,b)},min.vec=a,max.vec=b,
                                     MoreArgs = list(pn = n)))
}
dmviu <- function(x,a,b){
for(ii in 1:ncol(x)){
  x[,ii] <- dunif(x[,ii],a[ii],b[ii])
}
return(apply(x,1,prod))
}
UN.none.3 <- UNCOVER(X = CM,y = rv,deforest_criterion = "None",
                     options = UNCOVER.opts(prior.override = TRUE,
                                            rprior = rmviu,
                                            dprior = dmviu,a=rep(0,3),
                                            b=rep(1,3)),verbose = FALSE)
c(sum(UN.none$Model$Log_Marginal_Likelihoods),
  sum(UN.none.2$Model$Log_Marginal_Likelihoods),
  sum(UN.none.3$Model$Log_Marginal_Likelihoods))

# We may only wish to construct our minimum spanning tree based on the first
# variable
UN.none.4 <- UNCOVER(X = CM,y = rv,mst_var = 1,deforest_criterion = "None",
                     verbose = FALSE)
c(sum(UN.none$Model$Log_Marginal_Likelihoods),
  sum(UN.none.4$Model$Log_Marginal_Likelihoods))

# Increasing the stop criterion may uncover more clustering structure within
# the data, but comes with a time cost
system.time(UNCOVER(X = CM,y = rv,stop_criterion = 4,verbose = FALSE))
system.time(UNCOVER(X = CM,y = rv,stop_criterion = 6,verbose = FALSE))



[Package UNCOVER version 1.1.0 Index]