classify {netcom} | R Documentation |
Mechanistic Network Classification
Description
Tests a network against hypothetical generating processes using a comparative network inference.
Usage
classify(
network,
directed,
method = "DD",
net_kind = "matrix",
mechanism_kind = "canonical",
DD_kind = c("in", "out", "entropy_in", "entropy_out", "clustering_coefficient",
"page_rank", "communities", "motifs_3", "motifs_4", "eq_in", "eq_out",
"eq_entropy_in", "eq_entropy_out", "eq_clustering_coefficient", "eq_page_rank",
"eq_communities", "eq_motifs_3", "eq_motifs_4"),
DD_weight = c(0.0735367966, 0.0739940162, 0.0714523761, 0.0708156931, 0.0601296752,
0.0448072016, 0.0249793608, 0.0733125084, 0.0697029389, 0.0504358835, 0.0004016029,
0.0563752664, 0.0561878218, 0.0540490099, 0.0504347104, 0.0558106667, 0.0568270319,
0.0567474398),
cause_orientation = "row",
max_norm = FALSE,
resolution = 100,
resolution_min = 0.01,
resolution_max = 0.99,
reps = 3,
processes = c("ER", "PA", "DM", "SW", "NM"),
test = "empirical",
best_fit_finder = "systematic",
power_max = 5,
connectance_max = 0.5,
divergence_max = 0.5,
mutation_max = 0.5,
null_reps = 50,
best_fit_kind = "avg",
best_fit_sd = 0,
ks_dither = 0,
ks_alternative = "two.sided",
cores = 1,
size_different = FALSE,
null_dist_trim = 1,
verbose = FALSE
)
Arguments
network |
The network to be classified. |
directed |
Whether the target network is directed. If missing this will be inferred by the symmetry of the input network. |
method |
This determines the method used to compare networks at the heart of the classification. Currently "DD" (Degree Distribution) and "align" (the align function which compares networks by the entropy of diffusion on them) are supported. Future versions will allow user-defined methods. Defaults to "DD". |
net_kind |
If the network is an adjacency matrix ("matrix") or an edge list ("list"). Defaults to "matrix". |
mechanism_kind |
Either "canonical" or "grow" can be used to simulate networks. If "grow" is used, note that here it will only simulate pure mixtures made of a single mechanism. Defaults to "canonical". |
DD_kind |
= A vector of network properties to be used to compare networks. Defaults to "all", which is the average of the in- and out-degrees. |
DD_weight |
= Weights of each network property in DD_kind. Defaults to 1, which is equal weighting for each property. |
cause_orientation |
= The orientation of directed adjacency matrices. Defaults to "row". |
max_norm |
Binary variable indicating if each network property should be normalized so its max value (if a node-level property) is one. Defaults to FALSE. |
resolution |
Defaults to 100. The first step is to find the version of each process most similar to the target network. This parameter sets the number of parameter values to search across. Decrease to improve performance, but at the cost of accuracy. |
resolution_min |
Defaults to 0.01. The minimum parameter value to consider. Zero is not used because in many processes it results in degenerate systems (e.g. entirely unconnected networks). Currently process agnostic. Future versions will accept a vector of values, one for each process. |
resolution_max |
Defaults to 0.99. The maximum parameter value to consider. One is not used because in many processes it results in degenerate systems (e.g. entirely connected networks). Currently process agnostic. Future versions will accept a vector of values, one for each process. |
reps |
Defaults to 3. The number of networks to simulate for each parameter. More replicates increases accuracy by making the estimation of the parameter that produces networks most similar to the target network less idiosyncratic. |
processes |
Defaults to c("ER", "PA", "DD", "SW", "NM"). Vector of process abbreviations. Currently only the default five are supported. Future versions will accept user-defined network-generating functions and associated parameters. ER = Erdos-Renyi random. PA = Preferential Attachment. DD = Duplication and Divergence. SW = Small World. NM = Niche Model. |
test |
Defaults to "empirical". The test used to distinguish the null distribution of comparisons between the network being classified and the networks simulated according to a hypothesized mechanism(s), with a particular best-fitting parameter. "empirical" finds how many simulated networks were on average farther from each other than the network being classified is. "KS" uses a KS test. "WMWU" uses a Wilcoxon-Mann-Whitney-U test. |
best_fit_finder |
Defaults to "systematic". Determines how the best-fitting parameter of each mechanism specified in processes is found. "systematic" tries every parameter value from resolution_min to resolution_max with a step size of resolution_max - resolution_min / resolution. "optim_L-BFGS-B" uses the L-BFGS-B optimizer in the optimx package. "optim_GenSA" uses the GenSA optimizer in the GenSA package. |
power_max |
Defaults to 5. The maximum power of attachment in the Preferential Attachment process (PA). |
connectance_max |
= Defaults to 0.5. The maximum connectance parameter for the Niche Model. |
divergence_max |
= Defaults to 0.5. The maximum divergence parameter for the Duplication and Divergence/Mutation mechanisms. |
mutation_max |
= Defaults to 0.5. The maximum mutation parameter for the Duplication and Mutation mechanism. |
null_reps |
Defaults to 50. The number of best fit networks to simulate that will be used to create a null distribution of distances between networks within the given process, which will then be used to test if the target network appears unusually distant from them and therefore likely not governed by that process. |
best_fit_kind |
Defaults to "avg". If null_reps is more than 1, the fit of each parameter has to be an aggregate statistic of the fit of all the null_reps networks. Must be 'avg', 'median', 'min', or 'max'. |
best_fit_sd |
Defaults to 0. Standard Deviation used to simulate networks with a similar but not identical best fit parameter. This is important because simulating networks with the identical parameter can artificially inflate the false negative rate by assuming the best fit parameter is the true parameter. For large resolution and reps values this will become true, but can be computationally intractable for realistically large systems. |
ks_dither |
Defaults to 0. The KS test cannot compute exact p-values when every pairwise network distance is not unique. Adding small amounts of noise makes each distance unique. We are not aware of a study on the impacts this has on accuracy so it is set to zero by default. |
ks_alternative |
Defaults to "two.sided". Governs the KS test. Assuming best_fit_sd is not too large, this can be set to "greater" because the target network cannot be more alike identically simulated networks than they are to each other. In practice we have found "greater" and "less" produce numerical errors. Only "two.sided", "less", and "greater" are supported through stats::ks.test(). |
cores |
Defaults to 1. The number of cores to run the classification on. When set to 1 parallelization will be ignored. |
size_different |
= If there is a difference in the size of the networks used in the null distribution. Defaults to FALSE. |
null_dist_trim |
= Number between zero and one that determines how much of each network comparison distribution (unknown network compared to simulated networks, simulated networks compared to each other) should be used. Prevents p-value convergence with large sample sizes. Defaults to 1, which means all comparisons are used (no trimming). |
verbose |
Defaults to FALSE. Whether to print all messages. |
Details
Note: Currently each process is assumed to have a single governing parameter.
Value
A dataframe with 3 columns and as many rows as processes being tested (5 by default). The first column lists the processes. The second lists the p-value on the null hypothesis that the target network did come from that row's process. The third column gives the estimated parameter for that particular process.
References
Langendorf, R. E., & Burgess, M. G. (2020). Empirically Classifying Network Mechanisms. arXiv preprint arXiv:2012.15863.
Examples
# Import netcom
library(netcom)
# Adjacency matrix
size <- 10
network <- matrix(sample(c(0,1), size = size^2, replace = TRUE), nrow = size, ncol = size)
# Classify this network
# This can take several minutes to run
classify(network, processes = c("ER", "PA", "DM", "SW", "NM"))