modulePreservation {NetRep} | R Documentation |
Replication and preservation of network modules across datasets
Description
Quantify the preservation of network modules (sub-graphs) in an independent dataset through permutation testing on module topology. Seven network statistics (see details) are calculated for each module and then tested by comparing to distributions generated from their calculation on random subsets in the test dataset.
Usage
modulePreservation(
network,
data,
correlation,
moduleAssignments,
modules = NULL,
backgroundLabel = "0",
discovery = 1,
test = 2,
selfPreservation = FALSE,
nThreads = NULL,
nPerm = NULL,
null = "overlap",
alternative = "greater",
simplify = TRUE,
verbose = TRUE
)
Arguments
network |
a list of interaction networks, one for each dataset. Each
entry of the list should be a |
data |
a list of matrices, one for each dataset. Each entry of the list
should be the data used to infer the interaction |
correlation |
a list of matrices, one for each dataset. Each entry of
the list should be a |
moduleAssignments |
a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset. |
modules |
a list of vectors, one for each |
backgroundLabel |
a single label given to nodes that do not belong to
any module in the |
discovery |
a vector of names or indices denoting the discovery
dataset(s) in the |
test |
a list of vectors, one for each |
selfPreservation |
logical; if |
nThreads |
number of threads to parallelise the calculation of network properties over. Automatically determined as the number of cores - 1 if not specified. |
nPerm |
number of permutations to use. If not specified, the number of permutations will be automatically determined (see details). When set to 0 the permutation procedure will be skipped and the observed module preservation will be returned without p-values. |
null |
variables to include when generating the null distributions. Must be either "overlap" or "all" (see details). |
alternative |
The type of module preservation test to perform. Must be one of "greater" (default), "less" or "two.sided" (see details). |
simplify |
logical; if |
verbose |
logical; should progress be reported? Default is |
Details
Input data structures:
The preservation of network modules in a second dataset is quantified by
measuring the preservation of topological properties between the
discovery and test datasets. These properties are calculated
not only from the interaction networks inferred in each dataset, but also
from the data used to infer those networks (e.g. gene expression data) as
well as the correlation structure between variables/nodes. Thus, all
functions in the NetRep
package have the following arguments:
network
: a list of interaction networks, one for each dataset.data
: a list of data matrices used to infer those networks, one for each dataset.correlation
: a list of matrices containing the pairwise correlation coefficients between variables/nodes in each dataset.moduleAssignments
: a list of vectors, one for each discovery dataset, containing the module assignments for each node in that dataset.modules
: a list of vectors, one for each discovery dataset, containing the names of the modules from that dataset to analyse.discovery
: a vector indicating the names or indices of the previous arguments' lists to use as the discovery dataset(s) for the analyses.test
: a list of vectors, one vector for each discovery dataset, containing the names or indices of thenetwork
,data
, andcorrelation
argument lists to use as the test dataset(s) for the analysis of each discovery dataset.
The formatting of these arguments is not strict: each function will attempt
to make sense of the user input. For example, if there is only one
discovery
dataset, then input to the moduleAssigments
and
test
arguments may be vectors, rather than lists.
Analysing large datasets:
Matrices in the network
, data
, and correlation
lists
can be supplied as disk.matrix
objects. This class allows
matrix data to be kept on disk and loaded as required by NetRep.
This dramatically decreases memory usage: the matrices for only one
dataset will be kept in RAM at any point in time.
Additional memory usage of the permutation procedure is directly proportional to the sum of module sizes squared multiplied by the number of threads. Very large modules may result in significant additional memory usage per core due to extraction of the correlation coefficient sub-matrix at each permutation.
Module Preservation Statistics:
Module preservation is assessed through seven module preservation statistics, each of which captures a different aspect of a module's topology; i.e. the structure of the relationships between its nodes (1,2). Below is a description of each statistic, what they seek to measure, and where their interpretation may be inappropriate.
The module coherence ('coherence'
), average node
contribution ('avg.contrib'
), and concordance of node
contribution ('cor.contrib'
) are all calculated from the data used
to infer the network (provided in the 'data'
argument). They are
calculated from the module's summary profile. This is the eigenvector
of the 1st principal component across all observations for every node
composing the module. For gene coexpression modules this can be interpreted
as a "summary expression profile". It is typically referred to as the
"module eigengene" in the weighted gene coexpression network analysis
literature (4).
The module coherence ('coherence'
) quantifies the proportion
of module variance explained by the module's "summary profile". The higher
this value, the more "coherent" the data is, i.e. the more similar
the observations are nodes for each sample. With the default alternate
hypothesis, a small permutation P-value indicates that the module is
more coherent than expected by chance.
The average node contribution ('avg.contrib'
) and
concordance of node contribution ('cor.contrib'
) are calculated
from the node contribution, which quantifies how similar each node is
to the modules's summary profile. It is calculated as the Pearson
correlation coefficient between each node and the module summary profile. In
the weighted gene coexpression network literature it is typically called the
"module membership" (2).
The average node contribution ('avg.contrib'
) quantifies how
similar nodes are to the module summary profile in the test dataset. Nodes
detract from this score where the sign of their node contribution flips
between the discovery and test datasets, e.g. in the case of
differential gene expression across conditions. A high average node
contribution with a small permutation P-value indicates that the
module remains coherent in the test dataset, and that the nodes are acting
together in a similar way.
The concordance of node contribution ('cor.contrib'
) measures
whether the relative rank of nodes (in terms of their node contribution) is
preserved across datasets. If a module is coherent enough that all nodes
contribute strongly, then this statistic will not be meaningful as its value
will be heavily influenced by tiny variations in node rank. This can be
assessed through visualisation of the module topology (see
plotContribution
.) Similarly, a strong
'cor.contrib'
is unlikely to be meaningful if the
'avg.contrib'
is not significant.
The concordance of correlation strucutre ('cor.cor'
) and
density of correlation structure ('avg.cor'
) are calculated
from the user-provided correlation structure between nodes (provided in the
'correlation'
argument). This is referred to as "coexpression" when
calculated on gene expression data.
The 'avg.cor'
measures how strongly nodes within a module are
correlation on average in the test dataset. This average depends on the
correlation coefficients in the discovery dataset: the score is penalised
where correlation coefficients change in sign between datasets. A high
'avg.cor'
with a small permutation P-value indicates that the
module is (a) more strongly correlated than expected by chance for a module
of the same size, and (b) more consistently correlated with respect to the
discovery dataset than expected by chance.
The 'cor.cor'
measures how similar the correlation coefficients are
across the two datasets. A high 'cor.cor'
with a small permutation
P-value indicates that the correlation structure within a module is
more similar across datasets than expected by chance. If all nodes within a
module are very similarly correlated then this statistic will not be
meaningful, as its value will be heavily influenced by tiny, non-meaningful,
variations in correlation strength. This can be assessed through
visualisation of the module topology (see plotCorrelation
.)
Similarly, a strong 'cor.cor'
is unlikely to be meaningful if the
'avg.cor'
is not significant.
The average edge weight ('avg.weight'
) and concordance
of weighted degree ('cor.degree'
) are both calculated from the
interaction network (provided as adjacency matrices to the 'network'
argument).
The 'avg.weight'
measures the average connection strength between
nodes in the test dataset. In the weighted gene coexpression network
literature this is typically called the "module density" (2). A high
'avg.weight'
with a small permutation P-value indicates that
the module is more strongly connected in the test dataset than expected by
chance.
The 'cor.degree'
calculates whether the relative rank of each node's
weighted degree is similar across datasets. The weighted
degree is calculated as the sum of a node's edge weights to all other nodes
in the module. In the weighted gene coexpression network literature this is
typically called the "intramodular connectivity" (2). This statistic
will not be meaningful where all nodes are connected to each other with
similar strength, as its value will be heavily influenced by tiny,
non-meaningful, variations in weighted degree. This can be assessed through
visualisation of the module topology (see plotDegree
.)
Both the 'avg.weight'
and 'cor.degree'
assume edges are
weighted, and that the network is densely connected. Note that for sparse
networks, edges with zero weight are included when calculating both
statistics. Only the magnitude of the weights, not their sign, contribute to
the score. If the network is unweighted, i.e. edges indicate
presence or absence of a relationship, then the 'avg.weight'
will be
the proportion of the number of edges to the total number of possible edges
while the weighted degree simply becomes the degree. A high
'avg.weight'
in this case measures how interconnected a module is in
the test dataset. A high degree indicates that a node is connected to
many other nodes. The interpretation of the 'cor.degree'
remains
unchanged between weighted and unweighted networks. If the network is
directed the interpretation of the 'avg.weight'
remains unchanged,
while the cor.degree will measure the concordance of the node
in-degree in the test network. To measure the out-degree
instead, the adjacency matrices provided to the 'network'
argument
should be transposed.
Sparse data:
Caution should be used when running NetRep
on sparse data (i.e. where there are many zero values in the data
used to infer the network). For this data, the average node contribution
('avg.contrib'
), concordance of node contribution
('cor.contrib'
), and module coherence ('coherence'
)
will all be systematically underestimated due to their reliance on the
Pearson correlation coefficient to calculate the node contribution.
Care should also be taken to use appropriate methods for inferring the correlation structure when the data is sparse for the same reason.
Proportional data:
Caution should be used when running NetRep
on proportional data (
i.e. where observations across samples all sum to the same value,
e.g. 1). For this data, the average node contribution
('avg.contrib'
), concordance of node contribution
('cor.contrib'
), and module coherence ('coherence'
)
will all be systematically overestimated due to their reliance on the
Pearson correlation coefficient to calculate the node contribution.
Care should also be taken to use appropriate methods for inferring the correlation structure from proportional data for the same reason.
Hypothesis testing:
Three alternative hypotheses are available. "greater", the default, tests whether each module preservation statistic is larger than expected by chance. "lesser" tests whether each module preservation statistic is smaller than expected by chance, which may be useful for identifying modules that are extremely different in the test dataset. "two.sided" can be used to test both alternate hypotheses.
To determine whether a module preservation statistic deviates from chance, a
permutation procedure is employed. Each statistic is calculated between the
module in the discovery dataset and nPerm
random subsets of
the same size in the test dataset in order to assess the distribution
of each statistic under the null hypothesis.
Two models for the null hypothesis are available: "overlap", the default, only nodes that are present in both the discovery and test networks are used when generating null distributions. This is appropriate under an assumption that nodes that are present in the test dataset, but not present in the discovery dataset, are unobserved: that is, they may fall in the module(s) of interest in the discovery dataset if they were to be measured there. Alternatively, "all" will use all nodes in the test network when generating the null distributions.
The number of permutations required for any given significance threshold is
approximately 1 / the desired significance for one sided tests, and double
that for two-sided tests. This can be calculated with
requiredPerms
. When nPerm
is not specified, the number
of permutations is automatically calculated as the number required for a
Bonferroni corrected significance threshold adjusting for the total number
of tests for each statistic, i.e. the total number of modules to be analysed
multiplied by the number of test datasets each module is tested in.
Although assessing the replication of a small numberof modules calls for
very few permutations, we recommend using no fewer than 1,000 as fewer
permutations are unlikely to generate representative null distributions.
Note: the assumption used by requiredPerms
to
determine the correct number of permtutations breaks down when assessing the
preservation of modules in a very small dataset (e.g. gene sets in a dataset
with less than 100 genes total). However, the reported p-values will still
be accurate (see permutationTest
) (3).
Value
A nested list structure. At the top level, the list has one element per
'discovery'
dataset. Each of these elements is a list that has one
element per 'test'
dataset analysed for that 'discovery'
dataset. Each of these elements is also a list, containing the following
objects:
observed
: A matrix of the observed values for the module preservation statistics. Rows correspond to modules, and columns to the module preservation statistics.nulls
: A three dimensional array containing the values of the module preservation statistics evaluated on random permutation of module assignment in the test network. Rows correspond to modules, columns to the module preservation statistics, and the third dimension to the permutations.p.values
: A matrix of p-values for theobserved
module preservation statistics as evaluated through a permutation test using the corresponding values innulls
.nVarsPresent
: A vector containing the number of variables that are present in the test dataset for each module.propVarsPresent
: A vector containing the proportion of variables present in the test dataset for each module. Modules where this is less than 1 should be investigated further before making judgements about preservation to ensure that the missing variables are not the most connected ones.contingency
: IfmoduleAssignments
are present for both the discovery and test datasets, then a contingency table showing the overlap between modules across datasets is returned. Rows correspond to modules in the discovery dataset, columns to modules in the test dataset.
When simplify = TRUE
then the simplest possible structure will be
returned. E.g. if module preservation is tested in only one dataset, then
the returned list will have only the above elements.
When simplify = FALSE
then a nested list of datasets will always be
returned, i.e. each element at the top level and second level correspond to
a dataset, e.g. results[["Dataset1"]][["Dataset2"]]
indicates an
analysis where modules discovered in "Dataset1" are assessed for
preservation in "Dataset2". Dataset comparisons which have not been
assessed will contain NULL
.
References
-
Ritchie, S.C., et al., A scalable permutation approach reveals replication and preservation patterns of network modules in large datasets. Cell Systems. 3, 71-82 (2016).
-
Langfelder, P., Luo, R., Oldham, M. C. & Horvath, S. Is my network module preserved and reproducible? PLoS Comput. Biol. 7, e1001057 (2011).
-
Phipson, B. & Smyth, G. K. Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 9, Article39 (2010).
-
Langfelder, P. & Horvath, S. WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics 9, 559 (2008).
See Also
Functions for: visualising network modules, calculating module topology, calculating permutation test P-values, and splitting computation over multiple machines.
Examples
# load in example data, correlation, and network matrices for a discovery and test dataset:
data("NetRep")
# Set up input lists for each input matrix type across datasets. The list
# elements can have any names, so long as they are consistent between the
# inputs.
network_list <- list(discovery=discovery_network, test=test_network)
data_list <- list(discovery=discovery_data, test=test_data)
correlation_list <- list(discovery=discovery_correlation, test=test_correlation)
labels_list <- list(discovery=module_labels)
# Assess module preservation: you should run at least 10,000 permutations
preservation <- modulePreservation(
network=network_list, data=data_list, correlation=correlation_list,
moduleAssignments=labels_list, nPerm=1000, discovery="discovery",
test="test", nThreads=2
)