FusionLearn-package {FusionLearn}R Documentation

Fusion Learning

Description

FusionLearn package implements a new learning algorithm to integrate information from different experimental platforms. The algorithm applies the grouped penalization method in the pseudolikelihood setting.

Details

In the context of fusion learning, there are kk different data sets from kk different experimental platforms. The data from each platform can be modeled by a different generalized linear model. Assume the same set of predictors {M1,M2,...,Mj,...,Mp}\{M_1,M_2,...,M_j,...,M_p \} are measured across kk different experimental platforms.

Platforms Formula M1M_1 M2M_2 \dots MjM_j \dots MpM_p
1 y1:g1(μ1)y_1: g_1(\mu_1) \sim x11β11+x_{11}\beta_{11}+ x12β12+x_{12}\beta_{12}+ \dots x1jβ1j+x_{1j}\beta_{1j}+ \dots x1pβ1px_{1p}\beta_{1p}
2 y2:g2(μ2)y_2: g_2(\mu_2) \sim x21β21+x_{21}\beta_{21}+ x22β22+x_{22}\beta_{22}+ \dots x2jβ2j+x_{2j}\beta_{2j}+ \dots x2pβ2px_{2p}\beta_{2p}
...
k yk:gk(μk)y_k: g_k(\mu_k) \sim xk1βk1+x_{k1}\beta_{k1}+ xk2βk2+x_{k2}\beta_{k2}+ \dots xkjβkj+x_{kj}\beta_{kj}+ \dots xkpβkpx_{kp}\beta_{kp}

Here xkjx_{kj} represents the observation of the predictor MjM_j on the kkth platform, and β(j)\beta^{(j)} denotes the vector of regression coefficients for the predictor MjM_j.

Platforms Mj\bold{M_j} β(j)\bold{\beta^{(j)}}
1 x1jx_{1j} β1j\beta_{1j}
2 x2jx_{2j} β2j\beta_{2j}
... ...
k xkjx_{kj} βkj\beta_{kj}

Consider the following examples.

Example 1. Suppose kk different types of experiments are conducted to study the genetic mechanism of a disease. The predictors in this research are different facets of individual genes, such as mRNA expression, protein expression, RNAseq expression and so on. The goal is to select the genes which affect the disease, while the genes are assessed in a number of ways through different measurement processes across kk experimental platforms.

Example 2. The predictive models for three different financial indices are simultaneously built from a panel of stock index predictors. In this case, the predictor values across different models are the same, but the regression coefficients are different.

In the conventional approach, the model for each of the kk platforms is analyzed separately. FusionLearn algorithm selects significant predictors through learning from multiple models. The overall objective is to minimize the function:

Q(β)=lI(β)nj=1pΩλnβ(j),Q(\beta)=l_I(\beta)- n \sum_{j=1}^{p} \Omega_{\lambda_n} ||\beta^{(j)}||,

with pp being the numbers of predictors, Ωλn\Omega_{\lambda_n} being the penalty functions, and β(j)=(i=1kβij2)1/2||\beta^{(j)}|| = (\sum_{i=1}^{k}\beta_{ij}^2)^{1/2} denoting the L2L_2-norm of the coefficients of the predictor MjM_j.

The user can specify the penalty function Ωλn\Omega_{\lambda_n} and the penalty values λn\lambda_n. This package also contains functions to provide the pseudolikelihood Bayesian information criterion:

pseuBIC(s)=2lI(β^I;Y)+dsγn pseu-BIC(s) = -2l_I(\hat{\beta}_I;Y) + d_s^{*} \gamma_n

with 2lI(β^I;Y)-2l_I(\hat{\beta}_I; Y) denoting the pseudo loglikelihood, dsd_s^{*} measuring the model complexity and γn\gamma_n being the penalty on the model complexity.

The basic function fusionbase deals with continuous responses. The function fusionbinary is applied to binary responses, and the function fusionmixed is applied to a mix of continuous and binary responses.

Note

Here we provide two examples to illustrate the data structures. Assume XIX_I and XIIX_{II} represent two sets of the predictors from 2 experimental platforms.

Example 1. If the observations from XIX_I and XIIX_{II} are independent, the number of observations can be different. The order of the predictors {M1,M2,M3,M4}\{M_1, M_2, M_3, M_4\} in XIX_I matches with the predictors in XIIX_{II}. If XIIX_{II} does not include the predictor M3M_3, then the M3M_3 in XIIX_{II} needs to be filled with NA.

M1M_1 M2M_2 M3M_3 M4M_4 M1M_1 M2M_2 M3M_3 M4M_4
XI=X_I = 0.1 0.3 0.5 20 XII=X_{II} = 100 8 NA 100
0.3 0.1 0.5 7 30 1 NA 2
0.1 0.9 1 0 43 19 NA -3
-0.3 1.2 2 40

Example 2. If the observations from XIX_I and XIIX_{II} are correlated, the number of observations must be the same. The iith row in XIX_I is correlatd with the iith row in XIIX_{II}. The predictors of XIX_I and XIIX_{II} should be matched in order. The predictors which are not measured need to be filled with NA.

M1M_1 M2M_2 M3M_3 M4M_4 M1M_1 M2M_2 M3M_3 M4M_4
XI=X_I = 0.1 0.3 0.5 20 XII=X_{II} = 0.3 0.8 NA 100
0.3 0.1 0.5 70 0.2 1 NA 20
-0.1 0.9 1 0 0.43 1.9 NA -30
-0.3 1.2 2 40 -0.4 -2 NA 40

In functions fusionbase.fit, fusionbinary.fit, and fusionmixed.fit, the option depen is used to specify whether observations from different platforms are correlated or independent.

Author(s)

Xin Gao, Yuan Zhong and Raymond J Carroll

Maintainer: Yuan Zhong <aqua.zhong@gmail.com>

References

Gao, X and Carroll, R. J. (2017) Data integration with high dimensionality. Biometrika, 104, 2, pp. 251-272


[Package FusionLearn version 0.2.1 Index]