R: Robust gene-environment interaction analysis approach via...

Miss.boosting {GEInter}

R Documentation

Robust gene-environment interaction analysis approach via sparse boosting, where the missingness in environmental measurements is effectively accommodated using multiple imputation approach

Description

This gene-environment analysis approach includes three steps to accommodate both missingness in environmental (E) measurements and long-tailed or contaminated outcomes. At the first step, the multiple imputation approach based on sparse boosting method is developed to accommodate missingness in E measurements, where we use NA to represent those E measurments which are missing. Here a semiparametric model is assumed to accommodate nonlinear effects, where we model continuous E factors in a nonlinear way, and discrete E factors in a linear way. For estimating the nonlinear functions, the B spline expansion is adopted. At the second step, for each imputed data, we develop RobSBoosting approach for identifying important main E and genetic (G) effects, and G-E interactions, where the Huber loss function and Qn estimator are adopted to accommodate long-tailed distribution/data contamination (see RobSBoosting). At the third step, the identification results from Step 2 are combined based on stability selection technique.

Usage

Miss.boosting(
  G,
  E,
  Y,
  im_time = 10,
  loop_time = 500,
  num.knots = c(2),
  Boundary.knots,
  degree = c(2),
  v = 0.1,
  tau,
  family = c("continuous", "survival"),
  knots = NULL,
  E_type
)

Arguments

`G`	Input matrix of `p` genetic measurements consisting of `n` rows. Each row is an observation vector.
`E`	Input matrix of `q` environmental risk factors. Each row is an observation vector.
`Y`	Response variable. A quantitative vector for `family`="continuous". For `family`="survival", `Y` should be a two-column matrix with the first column being the log(survival time) and the second column being the censoring indicator. The indicator is a binary variable, with "1" indicating dead, and "0" indicating right censored.
`im_time`	Number of imputation for accommodating missingness in E variables.
`loop_time`	Number of iterations of the sparse boosting.
`num.knots`	Numbers of knots for the B spline basis.
`Boundary.knots`	The boundary of knots for the B spline basis.
`degree`	Degree for the B spline basis.
`v`	The step size used in the sparse boosting process. Default is 0.1.
`tau`	Threshold used in the stability selection at the third step.
`family`	Response type of `Y` (see above).
`knots`	List of knots for the B spline basis. Default is NULL and knots can be generated with the given `num.knots, degree` and `Boundary.knots`.
`E_type`	A vector indicating the type of each E factor, with "ED" representing discrete E factor, and "EC" representing continuous E factor.

Value

An object with S3 class "Miss.boosting" is returned, which is a list with the following components

`call`	The call that produced this object.
`alpha0`	A vector with each element indicating whether the corresponding E factor is selected.
`beta0`	A vector with each element indicating whether the corresponding G factor or G-E interaction is selected. The first element is the first G effect and the second to (`q+1`) elements are the interactions for the first G factor, and so on.
`intercept`	The intercept estimate.
`unique_variable`	A matrix with two columns that represents the variables that are selected for the model after removing the duplicates, since the `loop_time` iterations of the method may produce variables that are repeatedly selected into the model. Here, the first and second columns correspond to the indexes of E factors and G factors. For example, (1, 0) represents that this variable is the first E factor, and (1,2) represents that the variable is the interaction between the first E factor and second G factor.
`unique_coef`	Coefficients corresponding to `unique_variable`. Here, the coefficients are simple regression coefficients for the linear effect (discrete E factor, G factor, and their interaction), and B spline coefficients for the nonlinear effect (continuous E factor, and corresponding G-E interaction).
`unique_knots`	A list of knots corresponding to `unique_variable`. Here, when the type of `unique_variable` is discrete E factor, G factor or their interaction, knot will be NULL, and knots will be B spline otherwise.
`unique_Boundary.knots`	A list of boundary knots corresponding to `unique_variable`.
`unique_vtype`	A vector representing the variable type of `unique_variable`. Here, "EC" stands for continuous E effect, "ED" for discrete E effect, "G" for genetic factor variable, "EC-G" for the interaction between "EC" and "G", and "ED-G" for the interaction between "ED" and "G".
`degree`	Degree for the B spline basis.
`NorM`	The values of B spline basis.
`E_type`	The type of E effects.

References

Mengyun Wu and Shuangge Ma. Robust semiparametric gene-environment interaction analysis using sparse boosting. Statistics in Medicine, 38(23):4625-4641, 2019.

Examples

data(Rob_data)
G=Rob_data[,1:20];E=Rob_data[,21:24]
Y=Rob_data[,25];Y_s=Rob_data[,26:27]
knots=list();Boundary.knots=matrix(0,(20+4),2)
for (i in 1:4){
  knots[[i]]=c(0,1)
  Boundary.knots[i,]=c(0,1)
}
E2=E1=E

##continuous
E1[7,1]=NA
fit1<-Miss.boosting(G,E1,Y,im_time=1,loop_time=100,num.knots=c(2),Boundary.knots,
degree=c(2),v=0.1,tau=0.3,family="continuous",knots=knots,E_type=c("EC","EC","ED","ED"))
y1_hat=predict(fit1,matrix(E1[1,],nrow=1),matrix(G[1,],nrow=1))
plot(fit1)


##survival
E2[4,1]=NA
fit2<-Miss.boosting(G,E2,Y_s,im_time=2,loop_time=200,num.knots=c(2),Boundary.knots,
degree=c(2),v=0.1,tau=0.3,family="survival",knots,E_type=c("EC","EC","ED","ED"))
y2_hat=predict(fit2,matrix(E1[1,],nrow=1),matrix(G[1,],nrow=1))
plot(fit2)

[Package GEInter version 0.3.2 Index]