R: MIIC, causal network learning algorithm including latent...

miic {miic}

R Documentation

MIIC, causal network learning algorithm including latent variables

Description

MIIC (Multivariate Information based Inductive Causation) combines constraint-based and information-theoretic approaches to disentangle direct from indirect effects amongst correlated variables, including cause-effect relationships and the effect of unobserved latent causes.

Usage

miic(
  input_data,
  state_order = NULL,
  true_edges = NULL,
  black_box = NULL,
  n_threads = 1,
  cplx = c("nml", "mdl"),
  orientation = TRUE,
  ori_proba_ratio = 1,
  propagation = TRUE,
  latent = c("no", "yes", "orientation"),
  n_eff = -1,
  n_shuffles = 0,
  conf_threshold = 0,
  sample_weights = NULL,
  test_mar = TRUE,
  consistent = c("no", "orientation", "skeleton"),
  max_iteration = 100,
  consensus_threshold = 0.8,
  verbose = FALSE
)

Arguments

`input_data`	[a data frame] A n*d data frame (n samples, d variables) that contains the observational data. Each column corresponds to one variable and each row is a sample that gives the values for all the observed variables. The column names correspond to the names of the observed variables. Numeric columns will be treated as continuous values, factors and character as categorical.
`state_order`	[a data frame] An optional d*(2-3) data frame giving the order of the ordinal categorical variables. It will be used during post-processing to compute the signs of the edges using partial linear correlation. If specified, the data frame must have at least a "var_names" column, containing the names of each variable as specified by colnames(input_data). A "var_type" column may specify if each variable is to be considered as discrete (0) or continuous (1). And the "levels_increasing_order" column contains a single character string with all of the unique levels of the ordinal variable in increasing order, delimited by a comma. If the variable is categorical but not ordinal, the "levels_increasing_order" column may instead contain NA.
`true_edges`	[a data frame] An optional E*2 data frame containing the E edges of the true graph for computing performance after the run.
`black_box`	[a data frame] An optional E2 data frame containing E pairs of variables that will be considered as independent during the network reconstruction. In practice, these edges will not be included in the skeleton initialization and cannot be part of the final result. Variable names must correspond to the input_data* data frame.
`n_threads`	[a positive integer] When set greater than 1, n_threads parallel threads will be used for computation. Make sure your compiler is compatible with openmp if you wish to use multithreading.
`cplx`	[a string; c("nml", "mdl")] In practice, the finite size of the input dataset requires that the 2-point and 3-point information measures should be shifted by a complexity term. The finite size corrections can be based on the Minimal Description Length (MDL) criterion (set the option with "mdl"). In practice, the MDL complexity criterion tends to underestimate the relevance of edges connecting variables with many different categories, leading to the removal of false negative edges. To avoid such biases with finite datasets, the (universal) Normalized Maximum Likelihood (NML) criterion can be used (set the option with "nml"). The default is "nml" (see Affeldt et al., UAI 2015).
`orientation`	[a boolean value] The miic network skeleton can be partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. The propagation procedure relyes on probabilities; for more details, see Verny et al., PLoS Comp. Bio. 2017). If set to FALSE the orientation step is not performed.
`ori_proba_ratio`	[a floating point between 0 and 1] When orienting an edge according to the probability of orientation, the threshold to accept the orientation. For a given edge, denote by p > 0.5 the probability of orientation, the orientation is accepted if (1 - p) / p < ori_proba_ratio. 0 means reject all orientations, 1 means accept all orientations.
`propagation`	[a boolean value] If set to FALSE, the skeleton is partially oriented with only the v-structure orientations. Otherwise, the v-structure orientations are propagated to downstream undirected edges in unshielded triples following the orientation method
`latent`	[a string; c("no", "yes", "orientation")] When set to "yes", the network reconstruction is taking into account hidden (latent) variables. When set to "orientation", latent variables are not considered during the skeleton reconstruction but allows bi-directed edges during the orientation. Dependence between two observed variables due to a latent variable is indicated with a '6' in the adjacency matrix and in the network edges.summary and by a bi-directed edge in the (partially) oriented graph.
`n_eff`	[a positive integer] The n samples given in the input_data data frame are expected to be independent. In case of correlated samples such as in time series or Monte Carlo sampling approaches, the effective number of independent samples n_eff can be estimated using the decay of the autocorrelation function (Verny et al., PLoS Comp. Bio. 2017). This effective number n_eff of independent samples can be provided using this parameter.
`n_shuffles`	[a positive integer] The number of shufflings of the original dataset in order to evaluate the edge specific confidence ratio of all inferred edges.
`conf_threshold`	[a positive floating point] The threshold used to filter the less probable edges following the skeleton step. See Verny et al., PLoS Comp. Bio. 2017.
`sample_weights`	[a numeric vector] An optional vector containing the weight of each observation.
`test_mar`	[a boolean value] If set to TRUE, distributions with missing values will be tested with Kullback-Leibler divergence : conditioning variables for the given link `X\rightarrow YZ` will be considered only if the divergence between the full distribution and the non-missing distribution `KL(P(X,Y) \| P(X,Y)_{!NA})` is low enough (with `P(X,Y)_{!NA}` as the joint distribution of `X` and `Y` on samples which are not missing on Z. This is a way to ensure that data are missing at random for the considered interaction and to avoid selection bias. Set to TRUE by default
`consistent`	[a string; c("no", "orientation", "skeleton")] if "orientation": iterate over skeleton and orientation steps to ensure consistency of the network; if "skeleton": iterate over skeleton step to get a consistent skeleton, then orient edges and discard inconsistent orientations to ensure consistency of the network. See (Li et al., NeurIPS 2019) for details.
`max_iteration`	[a positive integer] When the consistent parameter is set to "skeleton" or "orientation", the maximum number of iterations allowed when trying to find a consistent graph. Set to 100 by default.
`consensus_threshold`	[a floating point between 0.5 and 1.0] When the consistent parameter is set to "skeleton" or "orientation", and when the result graph is inconsistent, or is a union of more than one inconsistent graphs, a consensus graph will be produced based on a pool of graphs. If the result graph is inconsistent, then the pool is made of [max_iteration] graphs from the iterations, otherwise it is made of those graphs in the union. In the consensus graph, the status of each edge is determined as follows: Choose from the pool the most probable status. For example, if the pool contains [A, B, B, B, C], then choose status B, if the frequency of presence of B (0.6 in the example) is equal to or higher than [consensus_threshold], then set B as the status of the edge in the consensus graph, otherwise set undirected edge as the status. Set to 0.8 by default.
`verbose`	[a boolean value] If TRUE, debugging output is printed.

Details

Starting from a complete graph, the method iteratively removes dispensable edges, by uncovering significant information contributions from indirect paths, and assesses edge-specific confidences from randomization of available data. The remaining edges are then oriented based on the signature of causality in observational data.

The method relies on an information theoretic based (conditional) independence test which is described in (Verny et al., PLoS Comp. Bio. 2017), (Cabeli et al., PLoS Comp. Bio. 2020). It deals with both categorical and continuous variables by performing optimal context-dependent discretization. As such, the input data frame may contain both numerical columns which will be treated as continuous, or character / factor columns which will be treated as categorical. For further details on the optimal discretization method and the conditional independence test, see the function discretizeMutual. The user may also choose to run miic with scheme presented in (Li et al., NeurIPS 2019) to improve the end result's interpretability by ensuring consistent separating set during the skeleton iterations.

Value

A miic-like object that contains:

all.edges.summary: a data frame with information about the relationship between each pair of variables
- x: X node
- y: Y node
- type: contains 'N' if the edge has been removed or 'P' for retained edges. If a true edges file is given, 'P' becomes 'TP' (True Positive) or 'FP' (False Positive), while 'N' becomes 'TN' (True Negative) or 'FN' (False Negative).
- ai: the contributing nodes found by the method which participate in the mutual information between x and y, and possibly separate them.
- info: provides the pairwise mutual information times Nxyi for the pair (x, y).
- info_cond: provides the conditional mutual information times Nxy_ai for the pair (x, y) when conditioned on the collected nodes ai. It is equal to the info column when ai is an empty set.
- cplx: gives the computed complexity between the (x, y) variables taking into account the contributing nodes ai. Edges that have have more conditional information info_cond than cplx are retained in the final graph.
- Nxy_ai: gives the number of complete samples on which the information and the complexity have been computed. If the input dataset has no missing value, the number of samples is the same for all pairs and corresponds to the total number of samples.
- log_confidence: represents the info - cplx value. It is a way to quantify the strength of the edge (x, y).
- confidenceRatio: this column is present if the confidence cut is > 0 and it represents the ratio between the probability to reject the edge (x, y) in the dataset versus the mean probability to do the same in multiple (user defined) number of randomized datasets.
- infOrt: the orientation of the edge (x, y). It is the same value as in the adjacency matrix at row x and column y : 1 for unoriented, 2 for an edge from X to Y, -2 from Y to X and 6 for bidirectional.
- trueOrt: the orientation of the edge (x, y) present in the true edges file if provided.
- isOrtOk: information about the consistency of the inferred graph’s orientations with a reference graph is given (i.e. if true edges file is provided). Y: the orientation is consistent; N: the orientation is not consistent with the PAG (Partial Ancestor Graph) derived from the given true graph.
- sign: the sign of the partial correlation between variables x and y, conditioned on the contributing nodes ai.
- partial_correlation: value of the partial correlation for the edge (x, y) conditioned on the contributing nodes ai.
- isCausal: details about the nature of the arrow tip for a directed edge. A directed edge in a causal graph does not necessarily imply causation but it does imply that the cause-effect relationship is not the other way around. An arrow-tip which is itself downstream of another directed edge suggests stronger causal sense and is marked by a 'Y', or 'N' otherwise.
- proba: probabilities for the inferred orientation, derived from the three-point mutual information (cf Affeldt & Isambert, UAI 2015 proceedings) and noted as p(x->y);p(x<-y).
retained.edges.summary: a data frame in the format of all.edges.summary containing only the inferred edges.
orientations.prob: this data frame lists the orientation probabilities of the two edges of all unshielded triples of the reconstructed network with the structure: node1 – mid-node – node2:
- node1: node at the end of the unshielded triplet
- p1: probability of the arrowhead node1 <- mid-node
- p2: probability of the arrowhead node1 -> mid-node
- mid-node: node at the center of the unshielded triplet
- p3: probability of the arrowhead mid-node <- node2
- p4: probability of the arrowhead mid-node -> node2
- node2: node at the end of the unshielded triplet
- NI3: 3 point (conditional) mutual information * N
AdjMatrix: the adjacency matrix is a square matrix used to represent the inferred graph. The entries of the matrix indicate whether pairs of vertices are adjacent or not in the graph. The matrix can be read as a (row, column) set of couples where the row represents the source node and the column the target node. Since miic can reconstruct mixed networks (including directed, undirected and bidirected edges), we will have a different digit for each case:
- 1: (x, y) edge is undirected
- 2: (x, y) edge is directed as x -> y
- -2: (x, y) edge is directed as x <- y
- 6: (x, y) edge is bidirected

References

Verny et al., PLoS Comp. Bio. 2017. https://doi.org/10.1371/journal.pcbi.1005662
Cabeli et al., PLoS Comp. Bio. 2020. https://doi.org/10.1371/journal.pcbi.1007866
Li et al., NeurIPS 2019 http://papers.nips.cc/paper/9573-constraint-based-causal-structure-learning-with-consistent-separating-sets.pdf

Examples

library(miic)

# EXAMPLE HEMATOPOIESIS
data(hematoData)

# execute MIIC (reconstruct graph)
miic.res <- miic(
  input_data = hematoData[1:1000,], latent = "yes",
  n_shuffles = 10, conf_threshold = 0.001
)

# plot graph
if(require(igraph)) {
 plot(miic.res, method="igraph")
}


# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).

miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp"))

# EXAMPLE CANCER
data(cosmicCancer)
data(cosmicCancer_stateOrder)
# execute MIIC (reconstruct graph)
miic.res <- miic(
  input_data = cosmicCancer, state_order = cosmicCancer_stateOrder, latent = "yes",
  n_shuffles = 100, conf_threshold = 0.001
)

# plot graph
if(require(igraph)) {
 plot(miic.res)
}

# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp"))

# EXAMPLE OHNOLOGS
data(ohno)
data(ohno_stateOrder)
# execute MIIC (reconstruct graph)
miic.res <- miic(
  input_data = ohno, latent = "yes", state_order = ohno_stateOrder,
  n_shuffles = 100, conf_threshold = 0.001
)

# plot graph
if(require(igraph)) {
 plot(miic.res)
}

# write graph to graphml format. Note that to correctly visualize
# the network we created the miic style for Cytoscape (http://www.cytoscape.org/).
miic.write.network.cytoscape(g = miic.res, file = file.path(tempdir(), "temp"))

[Package miic version 1.5.3 Index]