R: Sparse Genomic Prediction

5. Sparse Genomic Prediction (SGP) {SFSI}

R Documentation

Sparse Genomic Prediction

Description

Computes the entire Elastic-Net solution for the regression coefficients of a Selection Index for a grid of values of the penalization parameter.

An optimal penalization can be chosen using cross-validation (CV) within a specific training set.

Usage

SGP(y = NULL, X = NULL, b = NULL, Z = NULL, K = NULL,
    trn = NULL, tst = NULL, varU = NULL, varE = NULL,
    ID_geno = NULL, ID_trait = NULL, intercept = TRUE,
    alpha = 1, lambda = NULL, nlambda = 100,
    lambda.min = .Machine$double.eps^0.5,
    common.lambda = TRUE, subset = NULL, tol = 1E-4,
    maxiter = 500, method = c("REML","ML"), tag = NULL,
    save.at = NULL, precision.format = c("single","double"),
    mc.cores = 1L, verbose = 2)
    
SGP.CV(y, X = NULL, b = NULL, Z = NULL, K,
       trn = NULL, varU = NULL, varE = NULL,
       ID_geno = NULL, ID_trait = NULL,
       intercept = TRUE, alpha = 1, lambda = NULL,
       nlambda = 100, lambda.min = .Machine$double.eps^0.5,
       common.lambda = TRUE, nfolds = 5, nCV = 1L,
       folds = NULL, seed = NULL, subset = NULL, tol = 1E-4,
       maxiter = 500, method = c("REML","ML"), tag = NULL,
       save.at = NULL, mc.cores = 1L, verbose = TRUE)

Arguments

`y`	(numeric vector) Response variable. It can be a matrix with each column representing a different response variable. If it is passed to the 'SGP' function, predicted values for testing data are computed using phenotypes from training data
`X`	(numeric matrix) Design matrix for fixed effects. When `X = NULL`, a vector of ones is constructed only for the intercept (default)
`b`	(numeric vector) Fixed effects. When `b = NULL`, only the intercept is estimated from training data using generalized least squares (default)
`K`	(numeric matrix) Kinship relationship matrix
`Z`	(numeric matrix) Design matrix for random effects. When `Z = NULL` an identity matrix is considered (default) thus G = K; otherwise G = Z K Z' is used
`varU`, `varE`	(numeric) Genetic and residual variances. When either `varU = NULL` or `varE = NULL` (default), they are calculated from training data using the function 'fitBLUP' (see `help(fitBLUP)`) for single-trait analysis or using the function 'getGenCov' for multi-trait analysis
`ID_geno`	(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector `y` to rows/columns of matrix `G`
`ID_trait`	(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector `y` to rows/columns of matrices `varU` and `varE`
`intercept`	`TRUE` or `FALSE` to whether fit an intercept. When `FALSE`, the model assumes a null intercept
`trn`, `tst`	(integer vector) Which elements from vector `y` are in training and testing sets, respectively. When both `trn = NULL` and `tst = NULL`, non-NA entries in vector `y` will be used as training set
`subset`	(integer vector) `c(m,M)` to fit the model only for the m^th subset out of `M` subsets that the testing set will be divided into. Results can be automatically saved as per the `save.at` argument and can be retrieved later using function 'read_SGP' (see `help(read_SGP)`). In cross-validation, it should has format `c(fold,CV)` to fit the model for a given fold within partition
`alpha`	(numeric) Value between 0 and 1 for the weights given to the L1 and L2-penalties
`lambda`	(numeric vector) Penalization parameter sequence. Default is `lambda = NULL`, in this case a decreasing grid of `nlambda` lambdas will be generated starting from a maximum equal to max(abs(G[trn,tst])/alpha) to a minimum equal to zero. If `alpha = 0` the grid is generated starting from a maximum equal to 5
`nlambda`	(integer) Number of lambdas generated when `lambda = NULL`
`lambda.min`	(numeric) Minimum value of lambda in the generated grid when `lambda = NULL`
`nfolds`	(integer/character) Either 2,3,5,10 or 'n' indicating the number of non-overlapping folds in which the data is split into to do cross-validation. When `nfolds = 'n'` a leave-one-out CV is performed
`seed`	(numeric vector) Seed to fix randomization when creating folds for cross-validation. If it is a vector, a number equal to its length of CV repetitions are performed
`nCV`	(integer) Number of CV repetitions to be performed. Default is `nCV = 1`
`folds`	(integer matrix) A matrix with `nTRN` rows and `nCV` columns where each column represents a different partition with `nfolds` folds. It can be created using the function 'get_folds'
`common.lambda`	`TRUE` or `FALSE` to whether computing the coefficients for a grid of lambdas common to all individuals in testing set or for a grid of lambdas specific to each individual in testing set. Default is `common.lambda = TRUE`
`mc.cores`	(integer) Number of cores used. When `mc.cores` > 1, the analysis is run in parallel for each testing set individual. Default is `mc.cores = 1`
`tol`	(numeric) Maximum error between two consecutive solutions of the CD algorithm to declare convergence
`maxiter`	(integer) Maximum number of iterations to run the CD algorithm at each lambda step before convergence is reached
`save.at`	(character) Path where files (regression coefficients and output object) are to be saved (this may include a prefix added to the files). Default `save.at = NULL` will no save any results and they are returned in the output object. No regression coefficients are saved for function 'SGP.CV'
`precision.format`	(character) Either 'single' or 'double' for numeric precision and memory occupancy (4 or 8 bytes, respectively) of the regression coefficients. This is only used when `save.at` is not `NULL`
`method`	(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood) to calculate variance components as per the function 'fitBLUP'
`tag`	(character) Name given to the output for tagging purposes. Default `tag = NULL` will give the name of the method used
`verbose`	(integer) If greater than zero analysis details will be printed

Details

The basic linear mixed model that relates phenotypes with genetic values is of the form

y = X b + Z g + e

where y is a vector with the response, b is the vector of fixed effects, g is the vector of the genetic values of the genotypes, e is the vector of environmental residuals, and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(0,σ²_uK), and environmental terms are assumed e ~ N(0,σ²_eI).

The resulting vector of genetic values u = Z g will therefore follow u ~ N(0,σ²_uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.

The values u_tst = {u_i}, i = 1,2,...,n_tst, for a testing set are estimated individual-wise using (as predictors) all available observations in a training set as

u_i = β'_i (y_trn - X_trnb)

where β_i is a vector of weights that are found separately for each individual in the testing set, by minimizing the penalized mean squared error function

-σ²_uG'_trn,tst(i) β_i + 1/2 β'_i(σ²_uG_trn + σ²_eI)β_i + λ J(β_i)

where G_trn,tst(i) is the i^th column of the sub-matrix of G whose rows correspond to the training set and columns to the testing set; G_trn is the sub-matrix corresponding to the training set; λ is the penalization parameter; and J(β_i) is a penalty function given by

1/2(1-α)||β_i||₂² + α||β_i||₁

where 0 ≤ α ≤ 1, and ||β_i||₁ = ∑_j=1|β_ij| and ||β_i||₂² = ∑_j=1β_ij² are the L1 and (squared) L2-norms, respectively.

Function 'SGP' calculates each individual solution using the function 'solveEN' (via the Coordinate Descent algorithm, see help(solveEN)) by setting the argument Sigma equal to σ²_uG_trn + σ²_eI and Gamma equal to σ²_uG_trn,tst(i).

Function 'SGP.CV' performs cross-validation within the training data specified in argument trn. Training data is divided into k folds and the SGP is sequentially derived for (all individuals in) one fold (as testing set) using information from the remaining folds (as training set).

Value

Function 'SGP' returns a list object of the class 'SGP' for which methods coef, predict, plot, and summary exist. Functions 'net.plot' and 'path.plot' can be also used. It contains the elements:

b: (vector) fixed effects solutions (including the intercept).
Xb: (vector) total fixed values (X b).
u: (matrix) total genetic values (u = Z g) for testing individuals (in rows) associated to each value of lambda (in columns).
varU, varE, h2: variance components solutions.
alpha: value for the elastic-net weights used.
lambda: (matrix) sequence of values of lambda used (in columns) for each testing individual (in rows).
nsup: (matrix) number of non-zero predictors at each solution given by lambda for each testing individual (in rows).
file_beta: path where regression coefficients are saved.

Function 'SGP.CV' returns a list object of length nCV of the class 'SGP'. Optimal cross-validated penalization values can be obtained using thesummary method. Method plot is also available.

Examples

  require(SFSI)
  data(wheatHTP)
  
  index = which(Y$trial %in% 1:12)    # Use only a subset of data
  Y = Y[index,]
  M = scale(M[index,])/sqrt(ncol(M))  # Subset and scale markers
  G = tcrossprod(M)                   # Genomic relationship matrix
  Y0 = scale(as.matrix(Y[,4:6]))      # Response variable
  
  #---------------------------------------------------
  # Single-trait model
  #---------------------------------------------------
  y = Y0[,1]    
  
  # Training and testing sets
  tst = which(Y$trial %in% 2:3)
  trn = seq_along(y)[-tst]
  
  # Sparse genomic prediction
  fm1 = SGP(y, K=G, trn=trn, tst=tst)
  
  uHat = predict(fm1)        # Predicted values for each testing element
  out = summary(fm1)         # Useful function to get results
  corTST = out$accuracy      # Testing set accuracy (correlation cor(y,yHat))
  out$optCOR                 # SGP with maximum accuracy
  B = coef(fm1)              # Regression coefficients for all tst
  B = coef(fm1, iy=1)        # Coefficients for first tst (tst[1])
  B = coef(fm1, ilambda=10)  # Coefficients associated to the 10th lambda
  B = coef(fm1, nsup=10)     # Coefficients for which nsup=10
  
  plot(fm1)                  # Penalization vs accuracy plot
  plot(fm1, y.stat="MSE", ylab='Mean Square Error', xlab='Sparsity')     
  
  varU = fm1$varU
  varE = fm1$varE
  b = fm1$b
  
  #---------------------------------------------------
  # Predicting a testing set using a value of lambda
  # obtained from cross-validation in a traning set
  #---------------------------------------------------
  # Run a cross validation in training set
  fm2 = SGP.CV(y, K=G, varU=varU, varE=varE, b=b, trn=trn, nfolds=5, tag="1 5CV")
  lambda = summary(fm2)$optCOR["lambda"]

  # Fit the index with the obtained lambda
  fm3 = SGP(y, K=G, varU=varU, varE=varE, b=b, trn=trn, tst=tst, lambda=lambda)
  summary(fm3)$accuracy        # Testing set accuracy

  # Compare the accuracy with that of the non-sparse index (G-BLUP)
  summary(fm1)$accuracy[fm1$nlambda,1] # we take the last one
  
  #---------------------------------------------------
  # Multi-trait SGP
  #---------------------------------------------------
  ID_geno = as.vector(row(Y0))
  ID_trait = as.vector(col(Y0))
  y = as.vector(Y0)
  
  # Training and testing sets
  tst = c(which(ID_trait==1)[Y$trial %in% 2:3],
          which(ID_trait==2)[Y$trial %in% 2],
          which(ID_trait==3)[Y$trial %in% 3])
  trn = seq_along(y)[-tst]
  
  fmMT = SGP(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait,
             trn=trn, tst=tst)
  
  multitrait.plot(fmMT)

[Package SFSI version 1.4 Index]