Augmented.data {GEInter} | R Documentation |
Accommodating missingness in environmental measurements in gene-environment interaction analysis
Description
We consider the scenario with missingness in environmental (E) measurements. Our approach
consists of two steps. We first develop a nonparametric kernel-based data augmentation
approach to accommodate missingness. Then, we adopt a penalization approach BLMCP
for regularized estimation and selection of important interactions and main genetic (G) effects,
where the "main effects-interactions" hierarchical structure is respected.
As E variables are usually preselected and have a low dimension, selection is not conducted on E
variables. With a well-designed weighting scheme, a nice "byproduct" is that the proposed
approach enjoys a certain robustness property.
Usage
Augmented.data(G, E, Y, h, family = c("continuous", "survival"), E_type)
Arguments
G |
Input matrix of |
E |
Input matrix of |
Y |
Response variable. A quantitative vector for |
h |
The bandwidths of the kernel functions with the first and second elements corresponding to the discrete and continuous E factors. |
family |
Response type of |
E_type |
A vector indicating the type of each E factor, with "ED" representing discrete E factor, and "EC" representing continuous E factor. |
Value
E_w |
The augmented data corresponding to |
G_w |
The augmented data corresponding to |
y_w |
The augmented data corresponding to response |
weight |
The weights of the augmented observation data for accommodating missingness and also
right censoring if |
References
Mengyun Wu, Yangguang Zang, Sanguo Zhang, Jian Huang, and Shuangge Ma.
Accommodating missingness in environmental measurements in gene-environment interaction
analysis. Genetic Epidemiology, 41(6):523-554, 2017.
Jin Liu, Jian Huang, Yawei Zhang, Qing
Lan, Nathaniel Rothman, Tongzhang Zheng, and Shuangge Ma.
Identification of gene-environment interactions in cancer studies using penalization.
Genomics, 102(4):189-194, 2013.
Examples
set.seed(100)
sigmaG=AR(0.3,50)
G=MASS::mvrnorm(100,rep(0,50),sigmaG)
E=matrix(rnorm(100*5),100,5)
E[,2]=E[,2]>0
E[,3]=E[,3]>0
alpha=runif(5,2,3)
beta=matrix(0,5+1,50)
beta[1,1:7]=runif(7,2,3)
beta[2:4,1]=runif(3,2,3)
beta[2:3,2]=runif(2,2,3)
beta[5,3]=runif(1,2,3)
# continuous with Normal error N(0,4)
y1=simulated_data(G=G,E=E,alpha=alpha,beta=beta,error=rnorm(100,0,4),family="continuous")
# survival with Normal error N(0,1)
y2=simulated_data(G,E,alpha,beta,rnorm(100,0,1),family="survival",0.7,0.9)
# generate E measurements with missingness
miss_label1=c(2,6,8,15)
miss_label2=c(4,6,8,16)
E1=E2=E;E1[miss_label1,1]=NA;E2[miss_label2,1]=NA
# continuous
data_new1<-Augmented.data(G,E1,y1,h=c(0.5,1), family="continuous",
E_type=c("EC","ED","ED","EC","EC"))
fit1<-BLMCP(data_new1$G_w, data_new1$E_w, data_new1$y_w, data_new1$weight,
lambda1=0.025,lambda2=0.06,gamma1=3,gamma2=3,max_iter=200)
coef1=coef(fit1)
y1_hat=predict(fit1,E[c(1,2),],G[c(1,2),])
plot(fit1)
## survival
data_new2<-Augmented.data(G,E2,y2, h=c(0.5,1), family="survival",
E_type=c("EC","ED","ED","EC","EC"))
fit2<-BLMCP(data_new2$G_w, data_new2$E_w, data_new2$y_w, data_new2$weight,
lambda1=0.04,lambda2=0.05,gamma1=3,gamma2=3,max_iter=200)
coef2=coef(fit2)
y2_hat=predict(fit2,E[c(1,2),],G[c(1,2),])
plot(fit2)