R: Fusion learning algorithm for mixed data

fusionmixed {FusionLearn}

R Documentation

Fusion learning algorithm for mixed data

Description

fusionmixed conducts the group penalization with a specified penalty value learning from multiple generalized linear models with mixed continuous and binary responses. fusionmixed.fit can be used to search the best candidate model based on the pseudo Bayesian information criterion with a sequence of penalty values.

Usage

fusionmixed(x, y, lambda, N, p, m1, m2, beta=0.1, thresh=0.1, 
            maxiter=100, methods="scad", link="logit", Complete=TRUE)

fusionmixed.fit(x, y, lambda, N, p, m1, m2, beta=0.1, thresh=0.1, 
                maxiter=100, methods="scad",link="logit", Complete=TRUE, 
                depen ="IND", a=1)

Arguments

`x`	List. Listing matrices of the predictors from different platforms. The first m1 data sets in the list are the ones of continuous responses, and the following m2 data sets are the ones of binary responses.
`y`	List. A list of the responses vectors from different platforms following the same order as in `x`. The values `m1` and `m2` must be specified.
`lambda`	Numeric or vector. For `fusionmixed`, lambda is a numeric value for the penalty; for `fusionmixed.fit`, lambda is a vector with a list of penalty values.
`N`	Numeric or vector. If only one numeric value is provided, equal sample size will be assumed for each data set. If a vector is provided, then the elements are the sample sizes for all the platforms.
`p`	Numeric. The number of predictors.
`m1`	Numeric. Number of platforms whose response variables are continuous.
`m2`	Numeric. Number of platforms whose response variables are binary.
`beta`	Numeric. An initial value for the estimated parameters with dimensions nvars x nplatforms. The default value is 0.1.
`thresh`	Numeric. The stopping criteria. The default value is 0.1.
`maxiter`	Numeric. Maximum number of iterations. The default value is 100.
`methods`	Character ("lass" or "scad"). `lass`: LASSO; `scad`: SCAD.
`link`	Character ("logit" or "probit"). Link functions: logistic or probit.
`Complete`	Logic input. If `Complete == TRUE`, the predictors `M_1`,...,`M_p` are measured in all platforms. If `Compelte == FALSE`, in some platforms, not all of the predictors `\{M_1,M_2,...,M_p\}` are measured. The values of the corresponding estimated coefficients for the missing predictors will be `NA`.
`depen`	Character. Input only for function `fusionmixed.fit`. "IND" means the observations across different platforms are independent; "CORR" means the observations are correlated, and the sample sizes should be equal for different platforms.
`a`	Numeric. Input only for function `fusionmixed.fit`. The free multiplicative constant used in `\gamma_n`. The default value is 1.

Details

fusionmixed is designed for a more complex data structure by aggregating information from continuous and binary responses. More details regarding the algorithm can be found in FusionLearn.

Value

fusionmixed returns a list that has components:

`beta`	A matrix (nvars x nplatforms) containing estimated coefficients of each linear model. If some data sets do not have the complete set of predictors, the corresponding coefficients are output as `NA`.
`method`	Penalty function LASSO or SCAD.
`link`	The link function used in the estimation.
`threshold`	The numeric value shows the difference in the estimates between the successive updates upon convergence.
`iteration`	The numeric value shows the number of iterations upon convergence.

fusionmixed.fit provides the results in a table:

`lambda`	The sequence of penalty values.
`BIC`	The pseudolikelihood Bayesian information criterion evaluated at the sequence of the penalty values.
`-2Loglkh`	Minus twice the pseudo loglikelihood of the chosen model.
`Est_df`	The estimated degrees of freedom quantifying the model complexity.

fusionmixed.fit also returns a model selection plot showing the results above.

Note

The range of the penalty values should be carefully chosen. For some penalty values, the resulting models may have singular information matrix or the fitting of the glm cannot converge.

Author(s)

Xin Gao, Yuan Zhong, and Raymond J. Carroll

References

Gao, X and Carroll, R. J. (2017) Data integration with high dimensionality. Biometrika, 104, 2, pp. 251-272

Examples

##Analysis of the index data

#Responses contain indices "VIX","GSPC", and "DJI",  
#"DJI" is dichotomized into "increasing" or "decreasing"
y <- list(stockindexVIX[,1],stockindexGSPC[,1],stockindexDJI[,1]>0)

#Predictors include 46 stocks
x <- list(stockindexVIX[,2:47],stockindexGSPC[,2:47],stockindexDJI[,2:47])  
##Implementing the model selection based on psuedolikelihood 
##information criteria
model <- fusionmixed.fit(x,y,seq(0.03,5,length.out = 10),232,46,2,1,depen="CORR")
lambda <- model[which.min(model[,2]),1]  
result <- fusionmixed(x,y,lambda,232,46,2,1)

##Identify the significant predictors for three indices
id <- which(result$beta[,1]!=0)+1
colnames(stockindexVIX)[id]

[Package FusionLearn version 0.2.1 Index]