NetGSA {netgsa}R Documentation

Network-based Gene Set Analysis

Description

Tests the significance of pre-defined sets of genes (pathways) with respect to an outcome variable, such as the condition indicator (e.g. cancer vs. normal, etc.), based on the underlying biological networks.

Usage

NetGSA(A, x, group, pathways, lklMethod = "REHE", 
       sampling=FALSE, sample_n = NULL, sample_p = NULL, minsize=5, 
       eta = 0.1, lim4kappa = 500)

Arguments

A

A list of weighted adjacency matrices. Typically returned from prepareAdjMat

x

The p×np \times n data matrix with rows referring to genes and columns to samples. It is very important that the adjacency matrices A share the same rownames as the data matrix x.

group

Vector of class indicators of length nn.

pathways

The npath by pp indicator matrix for pathways.

lklMethod

Method used for variance component calculation: options are ML (maximum likelihood), REML (restricted maximum likelihood), HE (Haseman-Elston regression) or REHE (restricted Haseman-Elston regression). See details.

sampling

(Logical) whether to subsample the observations and/or variables. See details.

sample_n

The ratio for subsampling the observations if sampling=TRUE.

sample_p

The ratio for subsampling the variables if sampling=TRUE.

minsize

Minimum number of genes in pathways to be considered.

eta

Approximation limit for the Influence matrix. See 'Details'.

lim4kappa

Limit for condition number (used to adjust eta). See 'Details'.

Details

The function NetGSA carries out a Network-based Gene Set Analysis, using the method described in Shojaie and Michailidis (2009) and Shojaie and Michailidis (2010). It can be used for gene set (pathway) enrichment analysis where the data come from KK heterogeneous conditions, where KK, or more. NetGSA differs from Gene Set Analysis (Efron and Tibshirani, 2007) in that it incorporates the underlying biological networks. Therefore, when the networks encoded in A are empty, one should instead consider alternative approaches such as Gene Set Analysis (Efron and Tibshirani, 2007).

The NetGSA method is formulated in terms of a mixed linear model. Let XX represent the rearrangement of data x into an np×1np \times 1 column vector.

X=Ψβ+Πγ+ϵX=\Psi \beta + \Pi \gamma + \epsilon

where β\beta is the vector of fixed effects, γ\gamma and ϵ\epsilon are random effects and random errors, respectively. The underlying biological networks are encoded in the weighted adjacency matrices, which determine the influence matrix under each condition. The influence matrices further determine the design matrices Ψ\Psi and Π\Pi in the mixed linear model. Formally, the influence matrix under each condition represents the effect of each gene on all the other genes in the network and is calculated from the adjacency matrix (A[[k]] for the kk-th condition). A small value of eta is used to make sure that the influence matrices are well-conditioned (i.e. their condition numbers are bounded by lim4kappa.)

The problem is then to test the null hypothesis β=0\ell\beta = 0 against the alternative β0\ell\beta \neq 0, where \ell is a contrast vector, optimally defined through the underlying networks. For a one-sample or two-sample test, the test statistic TT for each gene set has approximately a t-distribution under the null, whose degrees of freedom are estimated using the Satterthwaite approximation method. When analyzing complex experiments involving multiple conditions, often multiple contrast vectors of interest are considered for a specific subnetwork. Alternatively, one can combine the contrast vectors into a contrast matrix LL. A different test statistic FF will be used. Under the null, FF has an F-distribution, whose degrees of freedom are calculated based on the contrast matrix LL as well as variances of γ\gamma and ϵ\epsilon. The fixed effects β\beta are estimated by generalized least squares, and the estimate depends on estimated variance components of γ\gamma and ϵ\epsilon.

Estimation of the variance components (σϵ2\sigma^2_{\epsilon} and σγ2\sigma^2_{\gamma}) can be done in several different ways after profiling out σϵ2\sigma^2_{\epsilon}, including REML/ML which uses Newton's method or HE/REHE which is based on the Haseman-Elston regression method. The latter notes the fact that Var(X)=σγ2ΠΠ+σϵ2IVar(X)=\sigma^2_{\gamma}\Pi*\Pi' + \sigma^2_{\epsilon}I, and uses an ordinary least squares to solve for the unknown coefficients after vectorizing both sides. In particular, REHE uses nonnegative least squares for the regression and therefore ensures nonnegative estimate of the variance components. Due to the simple formulation, HE/REHE also allows subsampling with respect to both the samples and the variables, and is recommended especially when the problem is large (i.e. large pp and/or large nn).

The pathway membership information is stored in pathways, which should be a matrix of npathnpath x pp. See prepareAdjMat for details on how to prepare a suitable pathway membership object.

This function can deal with both directed and undirected networks, which are specified via the option directed. Note NetGSA uses slightly different procedures to calculate the influence matrices for directed and undirected networks. In either case, the user can still apply NetGSA if only partial information on the adjacency matrices is available. The functions netEst.undir and netEst.dir provide details on how to estimate the weighted adjacency matrices from data based on available network information.

Value

A list with components

results

A data frame with pathway names, pathway sizes, p-values and false discovery rate corrected q-values, and test statistic for all pathways.

beta

Vector of fixed effects of length kpkp, the first k elements corresponds to condition 1, the second k to condition 2, etc

s2.epsilon

Variance of the random errors ϵ\epsilon.

s2.gamma

Variance of the random effects γ\gamma.

graph

List of components needed in plot.NetGSA.

Author(s)

Ali Shojaie and Jing Ma

References

Ma, J., Shojaie, A. & Michailidis, G. (2016) Network-based pathway enrichment analysis with incomplete network information. Bioinformatics 32(20):165–3174. doi:10.1093/bioinformatics/btw410

Shojaie, A., & Michailidis, G. (2010). Network enrichment analysis in complex experiments. Statistical applications in genetics and molecular biology, 9(1), Article 22. https://pubmed.ncbi.nlm.nih.gov/20597848/.

Shojaie, A., & Michailidis, G. (2009). Analysis of gene sets based on the underlying regulatory network. Journal of Computational Biology, 16(3), 407-426. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3131840/

See Also

prepareAdjMat, netEst.dir, netEst.undir

Examples


## load the data
data("breastcancer2012_subset")

## consider genes from just 2 pathways
genenames    <- unique(c(pathways[["Adipocytokine signaling pathway"]], 
                         pathways[["Adrenergic signaling in cardiomyocytes"]]))
sx           <- x[match(rownames(x), genenames, nomatch = 0L) > 0L,]

db_edges       <- obtainEdgeList(rownames(sx), databases = c("kegg", "reactome"))
adj_cluster    <- prepareAdjMat(sx, group, databases = db_edges, cluster = TRUE)
out_cluster    <- NetGSA(adj_cluster[["Adj"]], sx, group, 
                         pathways_mat[c(1,2), rownames(sx)], lklMethod = "REHE", sampling = FALSE)


[Package netgsa version 4.0.5 Index]