5. Sparse Genomic Prediction (SGP) {SFSI} | R Documentation |
Sparse Genomic Prediction
Description
Computes the entire Elastic-Net solution for the regression coefficients of a Selection Index for a grid of values of the penalization parameter.
An optimal penalization can be chosen using cross-validation (CV) within a specific training set.
Usage
SGP(y = NULL, X = NULL, b = NULL, Z = NULL, K = NULL,
trn = NULL, tst = NULL, varU = NULL, varE = NULL,
ID_geno = NULL, ID_trait = NULL, intercept = TRUE,
alpha = 1, lambda = NULL, nlambda = 100,
lambda.min = .Machine$double.eps^0.5,
common.lambda = TRUE, subset = NULL, tol = 1E-4,
maxiter = 500, method = c("REML","ML"), tag = NULL,
save.at = NULL, precision.format = c("single","double"),
mc.cores = 1L, verbose = 2)
SGP.CV(y, X = NULL, b = NULL, Z = NULL, K,
trn = NULL, varU = NULL, varE = NULL,
ID_geno = NULL, ID_trait = NULL,
intercept = TRUE, alpha = 1, lambda = NULL,
nlambda = 100, lambda.min = .Machine$double.eps^0.5,
common.lambda = TRUE, nfolds = 5, nCV = 1L,
folds = NULL, seed = NULL, subset = NULL, tol = 1E-4,
maxiter = 500, method = c("REML","ML"), tag = NULL,
save.at = NULL, mc.cores = 1L, verbose = TRUE)
Arguments
y |
(numeric vector) Response variable. It can be a matrix with each column representing a different response variable. If it is passed to the 'SGP' function, predicted values for testing data are computed using phenotypes from training data |
X |
(numeric matrix) Design matrix for fixed effects. When |
b |
(numeric vector) Fixed effects. When |
K |
(numeric matrix) Kinship relationship matrix |
Z |
(numeric matrix) Design matrix for random effects. When |
varU , varE |
(numeric) Genetic and residual variances. When either |
ID_geno |
(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector |
ID_trait |
(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector |
intercept |
|
trn , tst |
(integer vector) Which elements from vector |
subset |
(integer vector) |
alpha |
(numeric) Value between 0 and 1 for the weights given to the L1 and L2-penalties |
lambda |
(numeric vector) Penalization parameter sequence. Default is max(abs(G[trn,tst])/alpha) to a minimum equal to zero. If |
nlambda |
(integer) Number of lambdas generated when |
lambda.min |
(numeric) Minimum value of lambda in the generated grid when |
nfolds |
(integer/character) Either 2,3,5,10 or 'n' indicating the number of non-overlapping folds in which the data is split into to do cross-validation. When |
seed |
(numeric vector) Seed to fix randomization when creating folds for cross-validation. If it is a vector, a number equal to its length of CV repetitions are performed |
nCV |
(integer) Number of CV repetitions to be performed. Default is |
folds |
(integer matrix) A matrix with |
common.lambda |
|
mc.cores |
(integer) Number of cores used. When |
tol |
(numeric) Maximum error between two consecutive solutions of the CD algorithm to declare convergence |
maxiter |
(integer) Maximum number of iterations to run the CD algorithm at each lambda step before convergence is reached |
save.at |
(character) Path where files (regression coefficients and output object) are to be saved (this may include a prefix added to the files). Default |
precision.format |
(character) Either 'single' or 'double' for numeric precision and memory occupancy (4 or 8 bytes, respectively) of the regression coefficients. This is only used when |
method |
(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood) to calculate variance components as per the function 'fitBLUP' |
tag |
(character) Name given to the output for tagging purposes. Default |
verbose |
(integer) If greater than zero analysis details will be printed |
Details
The basic linear mixed model that relates phenotypes with genetic values is of the form
y = X b + Z g + e
where y is a vector with the response, b is the vector of fixed effects, g is the vector of the genetic values of the genotypes, e is the vector of environmental residuals, and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(0,σ2uK), and environmental terms are assumed e ~ N(0,σ2eI).
The resulting vector of genetic values u = Z g will therefore follow u ~ N(0,σ2uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.
The values utst = {ui}, i = 1,2,...,ntst, for a testing set are estimated individual-wise using (as predictors) all available observations in a training set as
ui = β'i (ytrn - Xtrnb)
where βi is a vector of weights that are found separately for each individual in the testing set, by minimizing the penalized mean squared error function
-σ2uG'trn,tst(i) βi + 1/2 β'i(σ2uGtrn + σ2eI)βi + λ J(βi)
where Gtrn,tst(i) is the ith column of the sub-matrix of G whose rows correspond to the training set and columns to the testing set; Gtrn is the sub-matrix corresponding to the training set; λ is the penalization parameter; and J(βi) is a penalty function given by
1/2(1-α)||βi||22 + α||βi||1
where 0 ≤ α ≤ 1, and ||βi||1 = ∑j=1|βij| and ||βi||22 = ∑j=1βij2 are the L1 and (squared) L2-norms, respectively.
Function 'SGP' calculates each individual solution using the function 'solveEN' (via the Coordinate Descent algorithm, see help(solveEN)
) by setting the argument Sigma
equal to
σ2uGtrn + σ2eI
and Gamma
equal to
σ2uGtrn,tst(i).
Function 'SGP.CV' performs cross-validation within the training data specified in argument trn
. Training data is divided into folds and the SGP is sequentially derived for (all individuals in) one fold (as testing set) using information from the remaining folds (as training set).
Value
Function 'SGP' returns a list object of the class 'SGP' for which methods coef
, predict
, plot
, and summary
exist. Functions 'net.plot' and 'path.plot' can be also used. It contains the elements:
-
b
: (vector) fixed effects solutions (including the intercept). -
Xb
: (vector) total fixed values (X b). -
u
: (matrix) total genetic values (u = Z g) for testing individuals (in rows) associated to each value of lambda (in columns). -
varU
,varE
,h2
: variance components solutions. -
alpha
: value for the elastic-net weights used. -
lambda
: (matrix) sequence of values of lambda used (in columns) for each testing individual (in rows). -
nsup
: (matrix) number of non-zero predictors at each solution given by lambda for each testing individual (in rows). -
file_beta
: path where regression coefficients are saved.
Function 'SGP.CV' returns a list object of length nCV
of the class 'SGP'. Optimal cross-validated penalization values can be obtained using thesummary
method. Method plot
is also available.
Examples
require(SFSI)
data(wheatHTP)
index = which(Y$trial %in% 1:12) # Use only a subset of data
Y = Y[index,]
M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers
G = tcrossprod(M) # Genomic relationship matrix
Y0 = scale(as.matrix(Y[,4:6])) # Response variable
#---------------------------------------------------
# Single-trait model
#---------------------------------------------------
y = Y0[,1]
# Training and testing sets
tst = which(Y$trial %in% 2:3)
trn = seq_along(y)[-tst]
# Sparse genomic prediction
fm1 = SGP(y, K=G, trn=trn, tst=tst)
uHat = predict(fm1) # Predicted values for each testing element
out = summary(fm1) # Useful function to get results
corTST = out$accuracy # Testing set accuracy (correlation cor(y,yHat))
out$optCOR # SGP with maximum accuracy
B = coef(fm1) # Regression coefficients for all tst
B = coef(fm1, iy=1) # Coefficients for first tst (tst[1])
B = coef(fm1, ilambda=10) # Coefficients associated to the 10th lambda
B = coef(fm1, nsup=10) # Coefficients for which nsup=10
plot(fm1) # Penalization vs accuracy plot
plot(fm1, y.stat="MSE", ylab='Mean Square Error', xlab='Sparsity')
varU = fm1$varU
varE = fm1$varE
b = fm1$b
#---------------------------------------------------
# Predicting a testing set using a value of lambda
# obtained from cross-validation in a traning set
#---------------------------------------------------
# Run a cross validation in training set
fm2 = SGP.CV(y, K=G, varU=varU, varE=varE, b=b, trn=trn, nfolds=5, tag="1 5CV")
lambda = summary(fm2)$optCOR["lambda"]
# Fit the index with the obtained lambda
fm3 = SGP(y, K=G, varU=varU, varE=varE, b=b, trn=trn, tst=tst, lambda=lambda)
summary(fm3)$accuracy # Testing set accuracy
# Compare the accuracy with that of the non-sparse index (G-BLUP)
summary(fm1)$accuracy[fm1$nlambda,1] # we take the last one
#---------------------------------------------------
# Multi-trait SGP
#---------------------------------------------------
ID_geno = as.vector(row(Y0))
ID_trait = as.vector(col(Y0))
y = as.vector(Y0)
# Training and testing sets
tst = c(which(ID_trait==1)[Y$trial %in% 2:3],
which(ID_trait==2)[Y$trial %in% 2],
which(ID_trait==3)[Y$trial %in% 3])
trn = seq_along(y)[-tst]
fmMT = SGP(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait,
trn=trn, tst=tst)
multitrait.plot(fmMT)