R: The graphsim package

graphsim-package {graphsim}

R Documentation

The graphsim package

Description

graphsim is a package to simulate normalised expression data from networks for biological pathways using ‘igraph’ objects and multivariate normal distributions.

Details

This package provides functions to develop simulated continuous data (e.g., gene expression) from a Sigma (\Sigma) covariance matrix derived from a graph structure in ‘igraph’ objects. Intended to extend ‘mvtnorm’ to take 'igraph' structures rather than sigma matrices as input. This allows the use of simulated data that correctly accounts for pathway relationships and correlations. Here we present a versatile statistical framework to simulate correlated gene expression data from biological pathways, by sampling from a multivariate normal distribution derived from a graph structure. This package allows the simulation of biologicalpathways from a graph structure based on a statistical model of gene expression, such as simulation of expression profiles that of log-transformed and normalised data from microarray and RNA-Seq data.

Introduction

This package enables the generation of simulated gene expression datasets containing pathway relationships from a known underlying network. These simulated datasets can be used to evaluate various bioinformatics methodologies, including statistical and network inference procedures.

These are computed by 1) resolving inhibitory states to derive a consistent matrix of positive and negative edges, 2) inferring relationships between nodes from paths in the graph, 3) weighting these in a Sigma (\Sigma) covariance matrix and 4) using this to sample a multivariate normal distribution.

Getting Started

The generate_expression function is a wrapper around all necessary functions to give a final simulated dataset.

Here we set up an example graph object using the igraph package.

library("igraph")
graph_structure_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"),c("D", "E"),
                               c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I"))
graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE)

Then we can call generate_expression to return the simulated data based on the relationships defined in the graph structure. Various options are available to fine-tune this.

expr <- generate_expression(100, graph_structure,
                            cor = 0.8,
                            mean = 0,
                            sd = 1,
                            comm = FALSE,
                            dist = TRUE,
                            absolute = FALSE,
                            laplacian = FALSE)

Here we can see the final result. The graph structure defines the covariance matrix used by rmvnorm to generate a multivariate distribution.

dim(expr)

library("gplots")
heatmap.2(expr,
          scale = "none",
          trace = "none",
          col = bluered(50),
          colsep = 1:4,
          rowsep = 1:4)

This dataset consists of 9 rows (one for each vertex or gene) in the graph and 100 columns (one for each sample or observation).

Input with an adjacency matrix is available using the generate_expression_mat function.

Creating Input Data

Graph structures can be passed directly from the igraph package. Using this package, you can create an ‘igraph’ class object.

> class(graph_structure)
[1] "igraph"

> graph_structure
IGRAPH ba7fa2f DN-- 9 8 -- 
  + attr: name (v/c)
  + edges from ba7fa2f (vertex names):
    [1] A->C B->C C->D D->E D->F F->G F->I H->I

This ‘igraph’ object class can be passed directly to generate_expression shown above and internal functions described below: make_sigma_mat_graph, make_sigma_mat_dist_graph, make_distance_graph, and make_state_matrix.

The ‘graphsim’ package also supports various matrix formats and has functions to handle these. The following functions will compute matrices from an ‘igraph’ object class:

make_adjmatrix_graph to derive the adjacency matrix for a graph structure.
make_commonlink_graph to derive the ‘common link’ matrix for a graph structure of mutually shared neighbours.
make_laplacian_graph to derive the Laplacian matrix for a graph structure.

The following functions will compute matrices from an adjacency matrix:

make_commonlink_adjmat to derive the ‘common link’ matrix for a graph structure of mutually shared neighbours.
make_laplacian_adjmat to derive the Laplacian matrix for a graph structure.

We provide some pre-generate pathways from Reactoem database for testing and demonstrations:

RAF_MAP_graph for the interactions in the “RAF/MAP kinase” cascade (17 vertices and 121 edges).
Pi3K_graph for the phosphoinositide-3-kinase cascade (35 vertices and 251 edges).
Pi3K_AKT_graph for the phosphoinositide-3-kinase activation of Protein kinase B pathway “PI3K/AKT activation” (275 vertices and 21106 edges).
TGFBeta_Smad_graph for the TGF-\beta receptor signaling activates SMADs pathway (32 vertices and 173 edges).

Please note that demonstrations on larger graph objects. These can be called directly from the pakage:

> graphsim::Pi3K_graph
IGRAPH 21437e3 DN-- 35 251 -- 
  + attr: name (v/c)
  + edges from 21437e3 (vertex names):
     [1] AKT1->AKT2  AKT1->AKT3  AKT1->CASP9 AKT1->CASP9
     [5] AKT1->CASP9 AKT1->FOXO1 AKT1->FOXO1 AKT1->FOXO1
     [9] AKT1->FOXO3 AKT1->FOXO3 AKT1->FOXO3 AKT1->FOXO4
     [13] AKT1->FOXO4 AKT1->FOXO4 AKT1->GSK3B AKT1->GSK3B
     [17] AKT1->GSK3B AKT1->NOS1  AKT1->NOS2  AKT1->NOS3 
     [21] AKT1->PDPK1 AKT2->AKT3  AKT2->CASP9 AKT2->CASP9
     [25] AKT2->CASP9 AKT2->FOXO1 AKT2->FOXO1 AKT2->FOXO1
     [29] AKT2->FOXO3 AKT2->FOXO3 AKT2->FOXO3 AKT2->FOXO4
     + ... omitted several edges
     + ... omitted several edges

They can also be imported into R:

data(Pi3K_graph)

You can assign them to your local environment by calling with from the package:

graph_object <- identity(Pi3K_graph)

You can also change the object class directly from the package:

library("igraph")
Pi3K_adjmat <- as_adjacency_matrix(Pi3K_graph)

Pi3K_AKT_graph and TGFBeta_Smad_graph contain graph edge attributes for the ‘state’ parameter described below.

 > TGFBeta_Smad_graph
 IGRAPH f3eac04 DN-- 32 173 -- 
   + attr: name (v/c), state (e/n)
   + edges from f3eac04 (vertex names):
     [1] BAMBI ->SMAD7  BAMBI ->TGFB1  BAMBI ->TGFBR1 BAMBI ->TGFBR2
     [5] CBL   ->NEDD8  CBL   ->NEDD8  CBL   ->TGFBR2 CBL   ->TGFBR2
     [9] CBL   ->UBE2M  CBL   ->UBE2M  FKBP1A->TGFB1  FKBP1A->TGFBR1
     [13] FKBP1A->TGFBR2 FURIN ->TGFB1  FURIN ->TGFB1  MTMR4 ->SMAD2 
     [17] MTMR4 ->SMAD2  MTMR4 ->SMAD3  MTMR4 ->SMAD3  NEDD4L->RPS27A
     [21] NEDD4L->SMAD7  NEDD4L->SMURF1 NEDD4L->SMURF2 NEDD4L->TGFB1 
     [25] NEDD4L->TGFBR1 NEDD4L->TGFBR2 NEDD4L->UBA52  NEDD4L->UBB   
     [29] NEDD4L->UBC    NEDD8 ->TGFBR2 NEDD8 ->UBE2M  PMEPA1->SMAD2 
     + ... omitted several edges
     
 > E(TGFBeta_Smad_graph)$state
 [1] 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [32] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [63] 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [94] 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 [125] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 [156] 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1
 
 > states <- E(TGFBeta_Smad_graph)$state
 > table(states)
 states
 1   2 
 103  70

Internal Functions

The following functions are used by generate_expression to compute a simulated dataset. They can be called separately to summarise the steps used to compute the final data matrix or for troubleshooting.

make_sigma_mat_adjmat, make_sigma_mat_comm, make_sigma_mat_laplacian, and make_sigma_mat_graph will compute a Sigma (\Sigma) covariance matrix from an adjacency matrix, common link matrix, Laplacian matrix, or an ‘igraph’ object. There are computed as above and passed to rmvnorm.
make_distance_adjmat, make_distance_comm, make_distance_laplacian, and make_distance_graph will compute a distance matrix of relationships from an adjacency matrix, common link matrix, Laplacian matrix, or an ‘igraph’ object. There are computed as above and passed to make_sigma.
make_state_matrix will compute a “state matrix” resolving positive and negative correlations from a vector of edge properties. This is called by make_sigma and generate_expression to ensure that the signs of correlations are consistent.

Examining Step-by-Step

These internal functions can be called to compute steps of the simulation procedure and examine the results.

1. first we create a graph structure and define the input parameters

library("igraph")
graph_structure_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"),c("D", "E"),
                               c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I"))
graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE)
#sample size
data.n <- 100
#data distributions
data.cor <- 0.75
data.mean <- 3
data.sd <- 1.5
#inhibition states
edge_states <- c(1, 1, -1, -1, 1, 1, 1, 1)

2. examine the relationships between the genes.

Here we can see which nodes share an edge:

> adjacency_matrix <- make_adjmatrix_graph(graph_structure)
> adjacency_matrix
  A C B D E F G I H
A 0 1 0 0 0 0 0 0 0
C 1 0 1 1 0 0 0 0 0
B 0 1 0 0 0 0 0 0 0
D 0 1 0 0 1 1 0 0 0
E 0 0 0 1 0 0 0 0 0
F 0 0 0 1 0 0 1 1 0
G 0 0 0 0 0 1 0 0 0
I 0 0 0 0 0 1 0 0 1
H 0 0 0 0 0 0 0 1 0

Here we define a geometrically decreasing series of relationships between genes based on distance by paths in the graph:

> relationship_matrix <- make_distance_graph(graph_structure, absolute = FALSE)
> relationship_matrix
  A          C          B          D          E          F          G          I          H
A 1.00000000 0.20000000 0.10000000 0.10000000 0.06666667 0.06666667 0.05000000 0.05000000 0.04000000
C 0.20000000 1.00000000 0.20000000 0.20000000 0.10000000 0.10000000 0.06666667 0.06666667 0.05000000
B 0.10000000 0.20000000 1.00000000 0.10000000 0.06666667 0.06666667 0.05000000 0.05000000 0.04000000
D 0.10000000 0.20000000 0.10000000 1.00000000 0.20000000 0.20000000 0.10000000 0.10000000 0.06666667
E 0.06666667 0.10000000 0.06666667 0.20000000 1.00000000 0.10000000 0.06666667 0.06666667 0.05000000
F 0.06666667 0.10000000 0.06666667 0.20000000 0.10000000 1.00000000 0.20000000 0.20000000 0.10000000
G 0.05000000 0.06666667 0.05000000 0.10000000 0.06666667 0.20000000 1.00000000 0.10000000 0.06666667
I 0.05000000 0.06666667 0.05000000 0.10000000 0.06666667 0.20000000 0.10000000 1.00000000 0.20000000
H 0.04000000 0.05000000 0.04000000 0.06666667 0.05000000 0.10000000 0.06666667 0.20000000 1.00000000

Here we can see the resolved edge states through paths in the adjacency matrix:

> names(edge_states) <- apply(graph_structure_edges, 1, paste, collapse = "-")
> edge_states
A-C B-C C-D D-E D-F F-G F-I H-I 
1   1  -1  -1   1   1   1   1 
> state_matrix <- make_state_matrix(graph_structure, state = edge_states)
> state_matrix
   A  C  B  D  E  F  G  I  H
A  1  1  1 -1  1 -1 -1 -1 -1
C  1  1  1 -1  1 -1 -1 -1 -1
B  1  1  1 -1  1 -1 -1 -1 -1
D -1 -1 -1  1 -1  1  1  1  1
E  1  1  1 -1  1 -1 -1 -1 -1
F -1 -1 -1  1 -1  1  1  1  1
G -1 -1 -1  1 -1  1  1  1  1
I -1 -1 -1  1 -1  1  1  1  1
H -1 -1 -1  1 -1  1  1  1  1

3. define a Sigma (\Sigma) covariance matrix

Here we can see that the signs match the state_matrix and the covariance is based on the relationship_matrix weighted by the correlation (cor) and standard deviation (sd) parameters.

Note that where sd = 1, the diagonals will be 1 and the off-diagonal terms will be correlations.

> sigma_matrix <- make_sigma_mat_dist_graph(
+     graph_structure,
+     state = edge_states,
+     cor = data.cor,
+     sd = data.sd,
+     absolute = FALSE
+ )
> sigma_matrix
   A         C         B        D         E        F         G         I         H
A  2.250000  1.687500  0.843750 -0.84375  0.562500 -0.56250 -0.421875 -0.421875 -0.337500
C  1.687500  2.250000  1.687500 -1.68750  0.843750 -0.84375 -0.562500 -0.562500 -0.421875
B  0.843750  1.687500  2.250000 -0.84375  0.562500 -0.56250 -0.421875 -0.421875 -0.337500
D -0.843750 -1.687500 -0.843750  2.25000 -1.687500  1.68750  0.843750  0.843750  0.562500
E  0.562500  0.843750  0.562500 -1.68750  2.250000 -0.84375 -0.562500 -0.562500 -0.421875
F -0.562500 -0.843750 -0.562500  1.68750 -0.843750  2.25000  1.687500  1.687500  0.843750
G -0.421875 -0.562500 -0.421875  0.84375 -0.562500  1.68750  2.250000  0.843750  0.562500
I -0.421875 -0.562500 -0.421875  0.84375 -0.562500  1.68750  0.843750  2.250000  1.687500
H -0.337500 -0.421875 -0.337500  0.56250 -0.421875  0.84375  0.562500  1.687500  2.250000

4. generate an expression dataset using this sigma matrix

We use generate_expression to compute and expression dataset, simulated using these parameters:

> expression_data <- generate_expression(
+     n = data.n,
+     graph_structure,
+     state = edge_states,
+     cor = data.cor,
+     mean = data.mean,
+     sd = data.sd,
+     comm = FALSE,
+     dist = FALSE,
+     absolute = FALSE,
+     laplacian = FALSE
+ )
> dim(expression_data)
[1]   9 100

Here we also compute the final observed correlations in the simulated dataset:

> cor_data <- cor(t(expression_data))
> dim(cor_data)
[1] 9 9

These functions are demonstrated in more detail in the main vignette.

Data Visualization

Heatmaps can be used from the gplots package to display these simulated datasets.

library("gplots")
heatmap.2(adjacency_matrix, scale = "none", trace = "none",
          col = colorpanel(50, "white", "black"), key = FALSE)
          
heatmap.2(relationship_matrix, scale = "none", trace = "none",
          col = colorpanel(50, "white", "red"))
          
heatmap.2(state_matrix, scale = "none", trace = "none",
          col = colorpanel(50, "royalblue", "palevioletred"),
          colsep = 1:length(V(graph_structure)),
          rowsep = 1:length(V(graph_structure)))

heatmap.2(sigma_matrix, scale = "none", trace = "none",
          col = colorpanel(50, "royalblue", "white", "palevioletred"),
          colsep = 1:length(V(graph_structure)),
          rowsep = 1:length(V(graph_structure)))
          
heatmap.2(expression_data, scale = "none", trace = "none",
          col = colorpanel(50, "royalblue", "white", "palevioletred"),
          colsep = 1:length(V(graph_structure)),
         rowsep = 1:length(V(graph_structure)))

heatmap.2(cor_data, scale = "none", trace = "none",
          col = colorpanel(50, "royalblue", "white", "palevioletred"),
          colsep = 1:length(V(graph_structure)),
          rowsep = 1:length(V(graph_structure)))

In particular we can see here that the expected correlations show by the sigma_matrix are similar to the observed correlations in the cor_data.

Graph Visualization

The ‘graphsim’ package comes with a built-in plotting function to display graph objects.

graph_structure_edges <- rbind(c("A", "C"), c("B", "C"), c("C", "D"),c("D", "E"),
                               c("D", "F"), c("F", "G"), c("F", "I"), c("H", "I"))
graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE)
plot_directed(graph_structure, layout = layout.kamada.kawai)

This supports the ‘state’ parameter to display activating relationships (with positive correlations) and inhibiting or repressive relationships (with negative correlations).

edge_states <- c(1, 1, -1, -1, 1, -1, 1, -1)
graph_structure <- graph.edgelist(graph_structure_edges, directed = TRUE)
plot_directed(graph_structure, state = edge_states,
              col.arrow = c("darkgreen", "red")[edge_states / 2 + 1.5]
              layout = layout.kamada.kawai)

These states can also be passed from the ‘state’ edge attribute of the graph object.

graph_pathway <- identity(TGFBeta_Smad_graph)
edge_properties <- E(graph_pathway)$state
plot_directed(graph_pathway,
              col.arrow = c(alpha("navyblue", 0.25),
                            alpha("red", 0.25))[edge_properties],
              fill.node = c("lightblue"),
              layout = layout.kamada.kawai)

This plotting function is demonstrated in more detail in the plots_directed.Rmd plotting vignette.

Further information

The graphsim package is published in the Journal of Open Source Software. See the paper here for more details: doi:10.21105/joss.02161

The graphsim GitHub repository is here: TomKellyGenetics/graphsim You can find the development version and submit an issue if you have questions or comments.

Citation

To cite package 'graphsim' in publications use:

Kelly, S.T. and Black, M.A. (2020). graphsim: An R package for simulating gene expression data from graph structures of biological pathways. Journal of Open Source Software, 5(51), 2161, doi:10.21105/joss.02161

A BibTeX entry for LaTeX users is:

  @article{Kelly2020joss02161,
     doi = {10.21105/joss.02161},
     year = {2020},
     publisher = {The Open Journal},
     volume = {5},
     number = {51},
     pages = {2161},
     author = {S. Thomas Kelly and Michael A. Black},
     title = {graphsim: An R package for simulating gene expression data from graph structures of biological pathways},
     journal = {Journal of Open Source Software} 
   }

Author(s)

Maintainer: Tom Kelly tom.kelly@riken.jp

Authors:

Tom Kelly (RIKEN IMS) ORCID)
Mik Black (Otago University) (ORCID)

Reviewers:

Cory Brunson (UConn) (ORCID)
Robrecht Cannoodt (Ghent University) (ORCID)

Editor: Mark Jensen (Frederick National Laboratory for Cancer Research)