R: Cross-validation for GEInfo

cv.GEInfo {GEInfo}

R Documentation

Cross-validation for GEInfo

Description

Does k-fold cross-validation for GEInfo approach, which adaptively accommodates the quality of the prior information and automatically detects the false information. Tuning parameters are chosen based on a user given criterion.

Usage

cv.GEInfo(
  E,
  G,
  Y,
  family,
  S_G,
  S_GE,
  nfolds = 3,
  xi = 6,
  epsilon = 0,
  max.it = 500,
  thresh = 0.001,
  criterion = "BIC",
  Type_Y = NULL,
  kappa1 = NULL,
  kappa2 = NULL,
  lam1 = NULL,
  lam2 = NULL,
  tau = c(0, 0.25, 0.5, 0.75, 1)
)

Arguments

`E`	Observed matrix of E variables, of dimensions n x q.
`G`	Observed matrix of G variables, of dimensions n x p.
`Y`	Response variable, of length n. Quantitative for family="gaussian", or family="poisson" (non-negative counts). For family="binomial" should be a factor with two levels.
`family`	Model type: one of ("gaussian", "binomial", "poisson").
`S_G`	A user supplied vector, denoting the subscript of G variables which have prior information.
`S_GE`	A user supplied matrix, denoting the subscript of GE interactions which have prior information. The first and second columns of S_GE represent the subscript of G variable and the subscript of E variable, respectively. For example, S_GE = matrix( c(1, 2), ncol = 2), which indicates that the 1st G variable and the 2nd E variables have an interaction effect on Y.
`nfolds`	Number of folds. Default is 3. Although nfolds can be as large as the sample size n (leave-one-out CV), it is not recommended for large datasets. Smallest value allowable is nfolds=3
`xi`	Tuning parameter of MCP penalty. Default is 6.
`epsilon`	Tuning parameter of Ridge penalty which shrinks on the coefficients having prior information. Default is 0.
`max.it`	Maximum number of iterations (total across entire path). Default is 500.
`thresh`	Convergence threshold for group coordinate descent algorithm. The algorithm iterates until the change for each coefficient is less than thresh. Default is 1e-3.
`criterion`	Criterion used for tuning selection via cross-validation. Currently five options: MSE, AIC, BIC, EBIC, GCV. Default is BIC. See Details.
`Type_Y`	A vector of Type_Y prior information, having the same length with Y. Default is NULL. For family="gaussian", Type_Y is continuous. For family="binomial", Type_Y is binary. For family="poisson", Type_Y is count. If users supply a Type_Y prior information, this function will use it to estimate a GEInfo model. If Type_Y=NULL, the function will incorporate the prior information included in S_G and S_GE to realize a GEInfo model.
`kappa1`	A user supplied kappa1 sequence. Default is kappa1=NULL. Typical usage is to have the program compute its own kappa1 sequence. Supplying a value of kappa1 overrides this. See Details.
`kappa2`	A user supplied kappa2 sequence. Default is kappa2=NULL. Typical usage is to have the program compute its own kappa2 sequence. Supplying a value of kappa2 overrides this. See Details.
`lam1`	A user supplied lambda1 sequence. Default is lam1=NULL. Typical usage is to have the program compute its own lambda1 sequence. Supplying a value of lam1 overrides this. See Details.
`lam2`	A user supplied lambda2 sequence. Default is lam2=NULL. Typical usage is to have the program compute its own lambda1 sequence. Supplying a value of lam2 overrides this. See Details.
`tau`	A user supplied tau sequence ranging from 0 to 1. Default is tau = c (0, 0.25,0.5,0.75,1). See Details.

Details

The function contains five tuning parameters, namely kappa1, kappa2, lambda1, lambda2, and tau. kappa1 and kappa2 are used to estimate model and select variables. lambda1 and lambda2 are used to calculate the prior-predicted response based on S_G and S_GE. tau is used for balancing between the observed response Y and the prior-predicted response. When tau=0 and tau=1, this function realizes cross-validation for GEsgMCP and CGEInfo approaches, respectively.

In order to select the optimal tuning combination, there are five criteria available, which are MSE, AIC, BIC, GCV, and EBIC. Let L be the loss function of the model, MSE=L, AIC=2L+2df, BIC=2L+ln(n)df, GCV=2L/(1-df/n)^2, and EBIC=2L+ln(n)df + 2df ln(nvar) (1-ln(n)/(2ln(nvar))). In most cases, BIC is a good choice. In the case of high dimension, EBIC criterion is recommended first, which has demonstrated satisfactory performance in high-dimensional studies.

Value

An object of class "GEInfo" is returned, which is a list with the ingredients of the cross-validation fit.

`coef.all.tau`	A matrix of coefficients, of dimensions (p+1)(q+1) x length(tau).
`best.tuning`	A list containing the optimal tau, kappa1, and kappa2.
`a`	Coefficient vector of length q for E variables.
`beta`	Coefficient vector of length p for E variables.
`gamma`	Coefficient matrix of dimensions p*q for G-E interactions.
`b`	Coefficient vector of length (q+1)p for W (G variables and G-E interactions).
`alpha`	Intercept.
`coef`	A coefficient vector of length (q+1)(p+1), including the estimates for `\alpha` (intercept), `a` (coefficients for all E variables), and `b` (coefficients for all G variables and G-E interactions).
`nvar`	Number of non-zero coefficients at the best tunings.

References

Wang X, Xu Y, and Ma S. (2019). Identifying gene-environment interactions incorporating prior information. Statistics in medicine, 38(9): 1620-1633. doi: 10.1002/sim.8064

Examples

n <- 30; p <- 4; q <- 2
E <- MASS::mvrnorm(n, rep(0,q), diag(q))
G <- MASS::mvrnorm(n, rep(0,p), diag(p))
W <- matW(E, G)
alpha <- 0; a <- seq(0.4, 0.6, length=q);
beta <- c(seq(0.2, 0.5, length=2), rep(0, p-2))
vector.gamma <- c(0.8, 0.9, 0, 0)
gamma <- matrix(c(vector.gamma, rep(0, p*q - length(vector.gamma))), nrow=p, byrow=TRUE)
mat.b.gamma <- cbind(beta, gamma)
b <- as.vector(t(mat.b.gamma))
Y <- alpha + E %*% a + W %*% b + rnorm (n, 0, 0.5)
S_G <- c(1)
S_GE <- cbind(c(1), c(1))
fit4 <- cv.GEInfo(E, G, Y, family='gaussian', S_G=S_G,
 S_GE=S_GE,lam1=0.4,lam2=0.4,kappa1 = 0.4,kappa2=0.4,tau=0.5)

[Package GEInfo version 1.0 Index]