R: Multi-Omic integration via Sparse Singular value...

moss {MOSS}

R Documentation

Multi-Omic integration via Sparse Singular value decomposition.

Description

This function integrates omic blocks to perform sparse singular value decomposition (SVD), non-linear embedding, and/or cluster analysis. Both supervised and unsupervised methods are supported. In both cases, if multiple omic blocks are used as predictors, they are concatenated and normalized to form an 'extended' omic matrix 'X' (Gonzalez-Reymundez and Vazquez, 2020). Supervised analysis can be obtained by indicating which omic block defines a multivariate response 'Y'. Each method within MOSS returns a matrix 'B', which form depends on the technique used (e.g. B = X in pca; B = X'Y, for pls; B = (X'X)^-X'Y, for lrr). A sparse SVD of matrix B is then obtained to summarize the variability among samples and features in terms of latent factors.

Usage

moss(
  data.blocks,
  scale.arg = TRUE,
  norm.arg = TRUE,
  method = "pca",
  resp.block = NULL,
  covs = NULL,
  K.X = 5,
  K.Y = K.X,
  verbose = TRUE,
  nu.parallel = FALSE,
  nu.u = NULL,
  nu.v = NULL,
  alpha.u = 1,
  alpha.v = 1,
  plot = FALSE,
  cluster = FALSE,
  clus.lab = NULL,
  tSNE = FALSE,
  axes.pos = seq_len(K.Y),
  approx.arg = FALSE,
  exact.dg = FALSE,
  use.fbm = FALSE,
  lib.thresh = TRUE
)

Arguments

`data.blocks`	List containing omic blocks of class 'matrix' or 'FBM'. In each block, rows represent subjects and columns features.
`scale.arg`	Should the omic blocks be centered and scaled? Logical. Defaults to TRUE.
`norm.arg`	Should omic blocks be normalized? Logical. Defaults to TRUE.
`method`	Multivariate method. Character. Defaults to 'pca'. Possible options are pca, mbpca, pca-lda, mbpca-lda, pls, mbpls, pls-lda, mbpls-lda, lrr, mblrr, lrr-lda, mblrr-lda.
`resp.block`	What block should be used as response? Integer. Only used when the specified method is supervised.
`covs`	Covariates which effect we wish to adjust for. Accepts matrix, data.frame, numeric, or character vectors.
`K.X`	Number of principal components for predictors. Integer. Defaults to 5.
`K.Y`	Number of responses PC index when method is supervised. Defaults to K.X.
`verbose`	Should we print messages? Logical. Defaults to TRUE.
`nu.parallel`	Tuning degrees of sparsity in parallel. Defaults to FALSE.
`nu.u`	A grid with increasing integers representing degrees of sparsity for left Eigenvectors. Defaults to NULL.
`nu.v`	Same but for right Eigenvectors. Defaults to NULL.
`alpha.u`	Elastic Net parameter for left Eigenvectors. Numeric between 0 and 1. Defaults to 1.
`alpha.v`	Elastic Net parameter for right Eigenvectors. Numeric between 0 and 1. Defaults to 1.
`plot`	Should results be plotted? Logical. Defaults to FALSE.
`cluster`	Arguments passed to the function tsne2clus as a list. Defaults to FALSE. If cluster=TRUE, default parameters are used (eps_range=c(0,4), eps_res=100).
`clus.lab`	A vector of same length than number of subjects with labels used to visualize clusters. Factor. Defaults to NULL. When sparsity is imposed on the left Eigenvectors, the association between non-zero loadings and labels' groups is shown by a Chi-2 statistics for each pc. When sparsity is not imposed, the association between labels and PC is addressed by a Kruskal-Wallis statistics.
`tSNE`	Arguments passed to the function pca2tsne as a list. Defaults to FALSE. If tSNE=T, default parameters are used (perp=50,n.samples=1,n.iter=1e3).
`axes.pos`	PC index used for tSNE. Defaults to 1 : K.Y. Used only when tSNE is different than NULL.
`approx.arg`	Should we use standard SVD or random approximations? Defaults to FALSE. If TRUE and at least one block is of class 'matrix', irlba is called. If TRUE & is(O,'FBM')==TRUE, big_randomSVD is called.
`exact.dg`	Should we compute exact degrees of sparsity? Logical. Defaults to FALSE. Only relevant When alpha.s or alpha.f are in the (0,1) interval and exact.dg = TRUE.
`use.fbm`	Should we treat omic blocks as Filed Backed Matrix (FBM)? Logical. Defaults to FALSE.
`lib.thresh`	Should we use a liberal or conservative threshold to tune degrees of sparsity? Logical. Defaults to TRUE.

Details

Once 'dense' solutions are found (the result of SVD on a matrix B), the function ssvdEN_sol_path is called to perform sparse SVD (sSVD) on a grid of possible degrees of sparsity (nu), for a possible value of the elastic net parameter (alpha). The sSVD is performed using the algorithm of Shen and Huang (2008), extended to include Elastic Net type of regularization. For one latent factor (rank 1 case), the algorithm finds vectors u and v' and scalar d that minimize:

||B-d* uv'||^2 + lambda(nu_v)(alpha_v||v'||_1 + (1-alpha_v)||v'||^2) + lambda(nu_u)(alpha_u||u||_1 + (1-alpha_u)||u||^2)

such that ||u|| = 1. The right Eigenvector is obtained from v / ||v|| and the corresponding d from u'Bv. The element lambda(nu_.) is a monotonically decreasing function of nu_. (the number of desired element different from zero) onto positive real numbers, and alpha_. is any number between zero and one balancing shrinking and variable selection. Selecting degree of sparsity: The function allows to tune the degree of sparsity using an ad-hoc method based on the one presented in Shen & Huang (2008, see reference) and generalized for tuning both nu_v and nu_u. This is done by exploring the proportion of explained variance (PEV) on a grid of possible values. Drastic and/or steep changes in the PEV trajectory across degrees of sparsity are used for automatic selection (see help for the function ssvdEN_sol_path). By imposing the additional assumption of omic blocks being conditionally independent, each multivariate technique can be extended using a 'multi-block' approach, where the contribution of each omic block to the total (co)variance is addressed. When response Y is a character column matrix, with classes or categories by subject, each multivariate technique can be extended to perform linear discriminant analysis.

Value

Returns a list with the results of the sparse SVD. If plot=TRUE, a series of plots is generated as well.

B: The object of the (sparse) SVD. Depending of the method used, B can be a extended matrix of normalized omic blocks, a variance-covariance matrix, or a matrix of regression coefficients. If at least one of the blocks in 'data.blocks' is of class FBM, is(B,'FBM') is TRUE. Otherwise, is(B,'matrix') is TRUE.
Q: Matrix with the SVD projections at the level of subjects.
selected_items: List containing the position, name, and loadings of selected features and subjects by latent dimension. if 'plot=TRUE', a scatterplot is displayed, where the x-axis represents the latent dimensions, the y-axis the total number of features selected in log scale, and each point is a pie chart showing the relative contribution of each omic to the number of features selected. The radio of the pie-chart represents the coefficient of variation among squared loadings (mean squared loadings divided by their standard deviation)
dense: A list containing the results of the dense SVD.
- u: Matrix with left Eigenvectors.
- v: Matrix with right Eigenvectors.
- d: Matrix with singular values.
sparse: A list containing the results of the sparse SVD.
- u: Matrix with left Eigenvectors.
- v: Matrix with right Eigenvectors.
- d: Matrix with singular values.
- opt_dg_v Selected degrees of sparsity for right Eigenvectors.
- opt_dg_u: Selected degrees of sparsity for left Eigenvectors.
Graphical displays: Depending on the values in 'plot', 'tSNE','cluster', and 'clus.lab' arguments, the following ggplot objects can be obtained. They contain:
- scree_plot: Plots of Eigenvalues and their first and second order empirical derivatives along PC indexes.
- tun_dgSpar_plot: Plots with the PEV trajectory, as well as its first and second empirical derivatives along the degrees of sparsity path.
- PC_plot: Plot of the first principal components according to axes.pos. By default the first two are plotted.
- tSNE_plot: Plot with the tSNE mapping onto two dimensions.
- clus_plot: The output of function tsne2clus.
- subLabels_vs_PCs: Plot of the Kruskal-Wallis (or Chi-square) statistics of the association test between PC (or selected subjects) and pre-established subjects groups.
- clusters_vs_PCs: Plot of the Kruskal-Wallis (or Chi-square) statistics of the association test between PC (or selected subjects) and detected clusters.

Note

The function does not return PEV for EN parameter (alpha_v and/or alpha_u), the user needs to provide a single value for each.
When number of PC index > 1, columns of T might not be orthogonal.
Although the user is encouraged to perform data projection and cluster separately, MOSS allows to do this automatically. However, both tasks might require further tuning than the provided by default, and computations could become cumbersome.
Tuning of degrees of sparsity is done heuristically on training set. In our experience, this results in high specificity, but rather low sensitivity. (i.e. too liberal cutoffs, as compared with extensive cross-validation on testing set).
When 'method' is an unsupervised technique, 'K.X' is the number of latent factors returned and used in further analysis. When 'method' is a supervised technique, 'K.X' is used to perform a SVD to facilitate the product of large matrices and inverses.
If 'K.X' (or 'K.Y') equal 1, no plots are returned.
Although the degree of sparsity maps onto number of features/subjects for Lasso, the user needs to be aware that this conceptual correspondence is lost for full EN (alpha belonging to (0, 1); e.g. the number of features selected with alpha < 1 will be eventually larger than the optimal degree of sparsity). This allows to rapidly increase the number of non-zero elements when tuning the degrees of sparsity. In order to get exact values for the degrees of sparsity at subjects or features levels, the user needs to set the value of 'exact.dg' parameter from 'FALSE' (the default) to 'TRUE'.

References

Gonzalez-Reymundez, and Vazquez. 2020. Multi-omic Signatures identify pan-cancer classes of tumors beyond tissue of origin. Scientific Reports 10 (1):8341
Shen, Haipeng, and Jianhua Z. Huang. 2008. Sparse Principal Component Analysis via Regularized Low Rank Matrix approximation. Journal of Multivariate Analysis 99 (6). Academic Press: 1015_34.
Baglama, Jim, Lothar Reichel, and B W Lewis. 2018. Irlba: Fast Truncated Singular Value Decomposition and Principal Components Analysis for Large Dense and Sparse Matrices.
Taskesen, Erdogan, Sjoerd M. H. Huisman, Ahmed Mahfouz, Jesse H. Krijthe, Jeroen de Ridder, Anja van de Stolpe, Erik van den Akker, Wim Verheagh, and Marcel J. T. Reinders. 2016. Pan-Cancer Subtyping in a 2D-Map Shows Substructures That Are Driven by Specific Combinations of Molecular Characteristics. Scientific Reports 6 (1):24949.
van der Maaten L, Hinton G. Visualizing Data using t-SNE. J Mach Learn Res. 2008;9: 2579–2605

Examples

# Example1: sparse PCA of a list of omic blocks.
library("MOSS")
sim_data <- simulate_data()
set.seed(43)

# Extracting simulated omic blocks.
sim_blocks <- sim_data$sim_blocks

# Extracting subjects and features labels.
lab.sub <- sim_data$labels$lab.sub
lab.feat <- sim_data$labels$lab.feat
out <- moss(sim_blocks[-4],
  method = "pca",
  nu.v = seq(1, 200, by = 100),
  nu.u = seq(1, 100, by = 50),
  alpha.v = 0.5,
  alpha.u = 1
)

library(ggplot2)
library(ggthemes)
library(viridis)
library(cluster)
library(fpc)

set.seed(43)

# Example2: sparse PCA with t-SNE, clustering, and association with
# predefined groups of subjects.
out <- moss(sim_blocks[-4],axes.pos=c(1:5),
  method = "pca",
  nu.v = seq(1, 200, by = 10),
  nu.u = seq(1, 100, by = 2),
  alpha.v = 0.5,
  alpha.u = 1,
  tSNE = TRUE,
  cluster = TRUE,
  clus.lab = lab.sub,
  plot = TRUE
)

# This shows clusters obtained with labels from pre-defined groups
# of subjects.
out$clus_plot

# This shows the statistical overlap between PCs and the pre-defined
# groups of subjects.
out$subLabels_vs_PCs

# This shows the contribution of each omic to the features 
# selected by PC index.
out$selected_items

# This shows features forming signatures across clusters.
out$feat_signatures

# Example3: Multi-block PCA with sparsity.
out <- moss(sim_blocks[-4],axes.pos=1:5,
  method = "mbpca",
  nu.v = seq(1, 200, by = 10),
  nu.u = seq(1, 100, by = 2),
  alpha.v = 0.5,
  alpha.u = 1,
  tSNE = TRUE,
  cluster = TRUE,
  clus.lab = lab.sub,
  plot = TRUE
)
out$clus_plot

# This shows the 'weight' each omic block has on the variability
# explained by each PC. Weights in each PC add up to one.
out$block_weights

# Example4: Partial least squares with sparsity (PLS).
out <- moss(sim_blocks[-4],axes.pos=1:5,
  K.X = 500,
  K.Y = 5,
  method = "pls",
  nu.v = seq(1, 100, by = 2),
  nu.u = seq(1, 100, by = 2),
  alpha.v = 1,
  alpha.u = 1,
  tSNE = TRUE,
  cluster = TRUE,
  clus.lab = lab.sub,
  resp.block = 3,
  plot = TRUE
)
out$clus_plot

# Get some measurement of accuracy at detecting features with signal
# versus background noise.
table(out$sparse$u[, 1] != 0, lab.feat[1:2000])
table(out$sparse$v[, 1] != 0, lab.feat[2001:3000])

# Example5: PCA-LDA
out <- moss(sim_blocks,
  method = "pca-lda",
  cluster = TRUE,
  resp.block = 4,
  clus.lab = lab.sub,
  plot = TRUE
)
out$clus_plot

[Package MOSS version 0.2.2 Index]