5. Sparse Genomic Prediction (SGP) {SFSI}R Documentation

Sparse Genomic Prediction

Description

Computes the entire Elastic-Net solution for the regression coefficients of a Selection Index for a grid of values of the penalization parameter.

An optimal penalization can be chosen using cross-validation (CV) within a specific training set.

Usage

SGP(y = NULL, X = NULL, b = NULL, Z = NULL, K = NULL,
    trn = NULL, tst = NULL, varU = NULL, varE = NULL,
    ID_geno = NULL, ID_trait = NULL, intercept = TRUE,
    alpha = 1, lambda = NULL, nlambda = 100,
    lambda.min = .Machine$double.eps^0.5,
    common.lambda = TRUE, subset = NULL, tol = 1E-4,
    maxiter = 500, method = c("REML","ML"), tag = NULL,
    save.at = NULL, precision.format = c("single","double"),
    mc.cores = 1L, verbose = 2)
    
SGP.CV(y, X = NULL, b = NULL, Z = NULL, K,
       trn = NULL, varU = NULL, varE = NULL,
       ID_geno = NULL, ID_trait = NULL,
       intercept = TRUE, alpha = 1, lambda = NULL,
       nlambda = 100, lambda.min = .Machine$double.eps^0.5,
       common.lambda = TRUE, nfolds = 5, nCV = 1L,
       folds = NULL, seed = NULL, subset = NULL, tol = 1E-4,
       maxiter = 500, method = c("REML","ML"), tag = NULL,
       save.at = NULL, mc.cores = 1L, verbose = TRUE)

Arguments

y

(numeric vector) Response variable. It can be a matrix with each column representing a different response variable. If it is passed to the 'SGP' function, predicted values for testing data are computed using phenotypes from training data

X

(numeric matrix) Design matrix for fixed effects. When X = NULL, a vector of ones is constructed only for the intercept (default)

b

(numeric vector) Fixed effects. When b = NULL, only the intercept is estimated from training data using generalized least squares (default)

K

(numeric matrix) Kinship relationship matrix

Z

(numeric matrix) Design matrix for random effects. When Z = NULL an identity matrix is considered (default) thus G = K; otherwise G = Z K Z' is used

varU, varE

(numeric) Genetic and residual variances. When either varU = NULL or varE = NULL (default), they are calculated from training data using the function 'fitBLUP' (see help(fitBLUP)) for single-trait analysis or using the function 'getGenCov' for multi-trait analysis

ID_geno

(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector y to rows/columns of matrix G

ID_trait

(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector y to rows/columns of matrices varU and varE

intercept

TRUE or FALSE to whether fit an intercept. When FALSE, the model assumes a null intercept

trn, tst

(integer vector) Which elements from vector y are in training and testing sets, respectively. When both trn = NULL and tst = NULL, non-NA entries in vector y will be used as training set

subset

(integer vector) c(m,M) to fit the model only for the mth subset out of M subsets that the testing set will be divided into. Results can be automatically saved as per the save.at argument and can be retrieved later using function 'read_SGP' (see help(read_SGP)). In cross-validation, it should has format c(fold,CV) to fit the model for a given fold within partition

alpha

(numeric) Value between 0 and 1 for the weights given to the L1 and L2-penalties

lambda

(numeric vector) Penalization parameter sequence. Default is lambda = NULL, in this case a decreasing grid of nlambda lambdas will be generated starting from a maximum equal to

max(abs(G[trn,tst])/alpha)

to a minimum equal to zero. If alpha = 0 the grid is generated starting from a maximum equal to 5

nlambda

(integer) Number of lambdas generated when lambda = NULL

lambda.min

(numeric) Minimum value of lambda in the generated grid when lambda = NULL

nfolds

(integer/character) Either 2,3,5,10 or 'n' indicating the number of non-overlapping folds in which the data is split into to do cross-validation. When nfolds = 'n' a leave-one-out CV is performed

seed

(numeric vector) Seed to fix randomization when creating folds for cross-validation. If it is a vector, a number equal to its length of CV repetitions are performed

nCV

(integer) Number of CV repetitions to be performed. Default is nCV = 1

folds

(integer matrix) A matrix with nTRN rows and nCV columns where each column represents a different partition with nfolds folds. It can be created using the function 'get_folds'

common.lambda

TRUE or FALSE to whether computing the coefficients for a grid of lambdas common to all individuals in testing set or for a grid of lambdas specific to each individual in testing set. Default is common.lambda = TRUE

mc.cores

(integer) Number of cores used. When mc.cores > 1, the analysis is run in parallel for each testing set individual. Default is mc.cores = 1

tol

(numeric) Maximum error between two consecutive solutions of the CD algorithm to declare convergence

maxiter

(integer) Maximum number of iterations to run the CD algorithm at each lambda step before convergence is reached

save.at

(character) Path where files (regression coefficients and output object) are to be saved (this may include a prefix added to the files). Default save.at = NULL will no save any results and they are returned in the output object. No regression coefficients are saved for function 'SGP.CV'

precision.format

(character) Either 'single' or 'double' for numeric precision and memory occupancy (4 or 8 bytes, respectively) of the regression coefficients. This is only used when save.at is not NULL

method

(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood) to calculate variance components as per the function 'fitBLUP'

tag

(character) Name given to the output for tagging purposes. Default tag = NULL will give the name of the method used

verbose

(integer) If greater than zero analysis details will be printed

Details

The basic linear mixed model that relates phenotypes with genetic values is of the form

y = X b + Z g + e

where y is a vector with the response, b is the vector of fixed effects, g is the vector of the genetic values of the genotypes, e is the vector of environmental residuals, and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(02uK), and environmental terms are assumed e ~ N(02eI).

The resulting vector of genetic values u = Z g will therefore follow u ~ N(02uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.

The values utst = {ui}, i = 1,2,...,ntst, for a testing set are estimated individual-wise using (as predictors) all available observations in a training set as

ui = β'i (ytrn - Xtrnb)

where βi is a vector of weights that are found separately for each individual in the testing set, by minimizing the penalized mean squared error function

2uG'trn,tst(i) βi + 1/2 β'i2uGtrn + σ2eI)βi + λ J(βi)

where Gtrn,tst(i) is the ith column of the sub-matrix of G whose rows correspond to the training set and columns to the testing set; Gtrn is the sub-matrix corresponding to the training set; λ is the penalization parameter; and J(βi) is a penalty function given by

1/2(1-α)||βi||22 + α||βi||1

where 0 ≤ α ≤ 1, and ||βi||1 = ∑j=1ij| and ||βi||22 = ∑j=1βij2 are the L1 and (squared) L2-norms, respectively.

Function 'SGP' calculates each individual solution using the function 'solveEN' (via the Coordinate Descent algorithm, see help(solveEN)) by setting the argument Sigma equal to σ2uGtrn + σ2eI and Gamma equal to σ2uGtrn,tst(i).

Function 'SGP.CV' performs cross-validation within the training data specified in argument trn. Training data is divided into k folds and the SGP is sequentially derived for (all individuals in) one fold (as testing set) using information from the remaining folds (as training set).

Value

Function 'SGP' returns a list object of the class 'SGP' for which methods coef, predict, plot, and summary exist. Functions 'net.plot' and 'path.plot' can be also used. It contains the elements:

Function 'SGP.CV' returns a list object of length nCV of the class 'SGP'. Optimal cross-validated penalization values can be obtained using thesummary method. Method plot is also available.

Examples

  require(SFSI)
  data(wheatHTP)
  
  index = which(Y$trial %in% 1:12)    # Use only a subset of data
  Y = Y[index,]
  M = scale(M[index,])/sqrt(ncol(M))  # Subset and scale markers
  G = tcrossprod(M)                   # Genomic relationship matrix
  Y0 = scale(as.matrix(Y[,4:6]))      # Response variable
  
  #---------------------------------------------------
  # Single-trait model
  #---------------------------------------------------
  y = Y0[,1]    
  
  # Training and testing sets
  tst = which(Y$trial %in% 2:3)
  trn = seq_along(y)[-tst]
  
  # Sparse genomic prediction
  fm1 = SGP(y, K=G, trn=trn, tst=tst)
  
  uHat = predict(fm1)        # Predicted values for each testing element
  out = summary(fm1)         # Useful function to get results
  corTST = out$accuracy      # Testing set accuracy (correlation cor(y,yHat))
  out$optCOR                 # SGP with maximum accuracy
  B = coef(fm1)              # Regression coefficients for all tst
  B = coef(fm1, iy=1)        # Coefficients for first tst (tst[1])
  B = coef(fm1, ilambda=10)  # Coefficients associated to the 10th lambda
  B = coef(fm1, nsup=10)     # Coefficients for which nsup=10
  
  plot(fm1)                  # Penalization vs accuracy plot
  plot(fm1, y.stat="MSE", ylab='Mean Square Error', xlab='Sparsity')     
  
  varU = fm1$varU
  varE = fm1$varE
  b = fm1$b
  
  #---------------------------------------------------
  # Predicting a testing set using a value of lambda
  # obtained from cross-validation in a traning set
  #---------------------------------------------------
  # Run a cross validation in training set
  fm2 = SGP.CV(y, K=G, varU=varU, varE=varE, b=b, trn=trn, nfolds=5, tag="1 5CV")
  lambda = summary(fm2)$optCOR["lambda"]

  # Fit the index with the obtained lambda
  fm3 = SGP(y, K=G, varU=varU, varE=varE, b=b, trn=trn, tst=tst, lambda=lambda)
  summary(fm3)$accuracy        # Testing set accuracy

  # Compare the accuracy with that of the non-sparse index (G-BLUP)
  summary(fm1)$accuracy[fm1$nlambda,1] # we take the last one
  
  #---------------------------------------------------
  # Multi-trait SGP
  #---------------------------------------------------
  ID_geno = as.vector(row(Y0))
  ID_trait = as.vector(col(Y0))
  y = as.vector(Y0)
  
  # Training and testing sets
  tst = c(which(ID_trait==1)[Y$trial %in% 2:3],
          which(ID_trait==2)[Y$trial %in% 2],
          which(ID_trait==3)[Y$trial %in% 3])
  trn = seq_along(y)[-tst]
  
  fmMT = SGP(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait,
             trn=trn, tst=tst)
  
  multitrait.plot(fmMT)
  

[Package SFSI version 1.4 Index]