OmicsPLS {OmicsPLS} | R Documentation |
Data integration with O2PLS: Two-Way Orthogonal Partial Least Squares
Description
The OmicsPLS package is an R package for penalized integration of heterogeneous omics data. The software articles are published in (el Bouhaddani et al, 2018, doi: 10.1186/s12859-018-2371-3) and (Gu et al, 2020, doi: 10.1186/s12859-021-03958-3). OmicsPLS includes the O2PLS fit, the GO2PLS fit, cross-validation tools and some misc functions.
Model and assumptions
Note that the rows of X
and Y
are the subjects and columns are variables.
The number of columns may be different, but the subjects should be the same in both datasets.
The O2PLS model (Trygg & Wold, 2003) decomposes two datasets and
into three parts.
1. A joint part, representing the relationship between
and
2. An orthogonal part, representing the unrelated latent variation in
and
separately.
3. A noise part capturing all residual variation.
See also the corresponding paper (el Bouhaddani et al, 2018).
Fitting
The O2PLS fit is done with o2m
.
For data X
and Y
you can run o2m(X,Y,n,nx,ny)
for an O2PLS fit with n
joint and nx, ny
orthogonal components.
See the help page of o2m
for more information on parameters.
There are four ways to obtain an O2PLS fit, depending on the dimensionality.
For the not-too-high dimensional case, you may use
o2m
with default parameters. E.g.o2m(X,Y,n,nx,ny)
.In case you only want the parameters, you may add
stripped = TRUE
to obtain a stripped version ofo2m
which avoids calculating and storing some matrices. E.g.o2m(X,Y,n,nx,ny,stripped=TRUE)
.For high dimensional cases, defined by
ncol(X)>p_thresh
andncol(Y)>q_thresh
, a NIPALS approach is used which avoids storing large matrices. E.g.o2m(X,Y,n,nx,ny,p_thresh=3000,q_thresh=3000)
. The thresholds are by default both at 3000 variables.If you want a stripped version in the high dimensional case, add
stripped = TRUE
. E.g.o2m(X,Y,n,nx,ny,stripped=TRUE,p_thresh=3000,q_thresh=3000)
.For GO2PLS, add
sparsity = TRUE
and specify how many variables or groups to retain. E.g.o2m(X,Y,n,nx,ny,sparse=TRUE,keepx, keepy)
.
Obtaining results
After fitting an O2PLS model, by running e.g. fit = o2m(X,Y,n,nx,ny)
, the results can be visualised.
Use plot(fit,...)
to plot the desired loadings with/without ggplot2.
Use summary(fit,...)
to see the relative explained variances in the joint/orthogonal parts.
Also plotting the joint scores fit$Tt, fit$U
and orthogonal scores fit$T_Yosc, fit$U_Xosc
are of help.
Cross-validating
Determining the number of components n,nx,ny
is an important task. For this we have two methods.
See citation("OmicsPLS")
for our proposed approach for determining the number of components, implemented in crossval_o2m_adjR2
!
Cross-validation (CV) is done with
crossval_o2m
andcrossval_o2m_adjR2
, both have built in parallelization which relies on theparallel
package. Usage is something likecrossval_o2m(X,Y,a,ax,ay,nr_folds)
wherea,ax,ay
are vectors of integers. See the help pages.nr_folds
is the number of folds, withnr_folds = nrow(X)
for Leave-One-Out CV.For
crossval_o2m_adjR2
the same parameters are to be specified. This way of cross-validating is (potentially much) faster than the standard approach. It is also recommended over the standard CV.To cross-validate the number of variables to keep, use
crossval_sparsity
.
S3 methods
There are S3 methods implemented for a fit obtained with o2m
, i.e. fit <- o2m(X,Y,n,nx,ny)
Use plot(fit) to plot the loadings, see above.
Use
loadings(fit)
to extract a matrix with loading valuesUse
scores(fit)
to extract the scores
Imputation
When the data contains missing values, one should impute them prior to using O2PLS.
There are many sophisticated approaches available, such as MICE and MissForest, and no one approach is the best for all situations.
To still allow users to quickly impute missing values in their data matrix,
the impute_matrix
function is implemented.
It relies on the softImpute
function+package and imputes based on the singular value decomposition.
Misc
Also some handy tools are available
-
orth(X)
is a function to obtain an orthogonalized version of a matrix or vectorX
. -
ssq(X)
is a function to calculate the sum of squares (or squared Frobenius norm) ofX
. See alsovnorm
for calculating the norm of each column inX
. -
mse(x, y)
returns the mean squared difference between two matrices/vectors.
Citation
If you use the OmicsPLS R package in your research, please cite the corresponding software paper:
el Bouhaddani, S., Uh, H.-W., Jongbloed, G., Hayward, C., Klarić, L., Kiełbasa, S. M., & Houwing-Duistermaat, J. (2018). Integrating omics datasets with the OmicsPLS package. BMC Bioinformatics, 19(1). doi: 10.1186/s12859-018-2371-3
The bibtex entry can be obtained with command citation("OmicsPLS")
.
Thank you!
The original paper proposing O2PLS is
Trygg, J., & Wold, S. (2003). O2-PLS, a two-block (X-Y) latent variable regression (LVR) method with an integral OSC filter. Journal of Chemometrics, 17(1), 53-64. doi: 10.1002/cem.775
Author(s)
Said el Bouhaddani (s.elbouhaddani@umcutrecht.nl, Twitter: @selbouhaddani), Zhujie Gu, Szymon Kielbasa, Geurt Jongbloed, Jeanine Houwing-Duistermaat, Hae-Won Uh.
Maintainer: Said el Bouhaddani (s.elbouhaddani@umcutrecht.nl).