5. Sparse Genomic Prediction (SGP) {SFSI} | R Documentation |
Sparse Genomic Prediction
Description
Computes the entire Elastic-Net solution for the regression coefficients of a Selection Index for a grid of values of the penalization parameter.
An optimal penalization can be chosen using cross-validation (CV) within a specific training set.
Usage
SGP(y = NULL, X = NULL, b = NULL, Z = NULL, K = NULL,
trn = NULL, tst = NULL, varU = NULL, varE = NULL,
ID_geno = NULL, ID_trait = NULL, intercept = TRUE,
alpha = 1, lambda = NULL, nlambda = 100,
lambda.min = .Machine$double.eps^0.5,
common.lambda = TRUE, subset = NULL, tol = 1E-4,
maxiter = 500, method = c("REML","ML"), tag = NULL,
save.at = NULL, precision.format = c("single","double"),
mc.cores = 1L, verbose = 2)
SGP.CV(y, X = NULL, b = NULL, Z = NULL, K,
trn = NULL, varU = NULL, varE = NULL,
ID_geno = NULL, ID_trait = NULL,
intercept = TRUE, alpha = 1, lambda = NULL,
nlambda = 100, lambda.min = .Machine$double.eps^0.5,
common.lambda = TRUE, nfolds = 5, nCV = 1L,
folds = NULL, seed = NULL, subset = NULL, tol = 1E-4,
maxiter = 500, method = c("REML","ML"), tag = NULL,
save.at = NULL, mc.cores = 1L, verbose = TRUE)
Arguments
y |
(numeric vector) Response variable. It can be a matrix with each column representing a different response variable. If it is passed to the 'SGP' function, predicted values for testing data are computed using phenotypes from training data |
X |
(numeric matrix) Design matrix for fixed effects. When |
b |
(numeric vector) Fixed effects. When |
K |
(numeric matrix) Kinship relationship matrix |
Z |
(numeric matrix) Design matrix for random effects. When |
varU , varE |
(numeric) Genetic and residual variances. When either |
ID_geno |
(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector |
ID_trait |
(character/integer) For multi-trait analysis only, vector with either names or indices mapping entries of the vector |
intercept |
|
trn , tst |
(integer vector) Which elements from vector |
subset |
(integer vector) |
alpha |
(numeric) Value between 0 and 1 for the weights given to the L1 and L2-penalties |
lambda |
(numeric vector) Penalization parameter sequence. Default is max(abs(G[trn,tst])/alpha) to a minimum equal to zero. If |
nlambda |
(integer) Number of lambdas generated when |
lambda.min |
(numeric) Minimum value of lambda in the generated grid when |
nfolds |
(integer/character) Either 2,3,5,10 or 'n' indicating the number of non-overlapping folds in which the data is split into to do cross-validation. When |
seed |
(numeric vector) Seed to fix randomization when creating folds for cross-validation. If it is a vector, a number equal to its length of CV repetitions are performed |
nCV |
(integer) Number of CV repetitions to be performed. Default is |
folds |
(integer matrix) A matrix with |
common.lambda |
|
mc.cores |
(integer) Number of cores used. When |
tol |
(numeric) Maximum error between two consecutive solutions of the CD algorithm to declare convergence |
maxiter |
(integer) Maximum number of iterations to run the CD algorithm at each lambda step before convergence is reached |
save.at |
(character) Path where files (regression coefficients and output object) are to be saved (this may include a prefix added to the files). Default |
precision.format |
(character) Either 'single' or 'double' for numeric precision and memory occupancy (4 or 8 bytes, respectively) of the regression coefficients. This is only used when |
method |
(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood) to calculate variance components as per the function 'fitBLUP' |
tag |
(character) Name given to the output for tagging purposes. Default |
verbose |
(integer) If greater than zero analysis details will be printed |
Details
The basic linear mixed model that relates phenotypes with genetic values is of the form
y = X b + Z g + e
where y is a vector with the response, b is the vector of fixed effects, g is the vector of the genetic values of the genotypes, e is the vector of environmental residuals, and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(0,σ2uK), and environmental terms are assumed e ~ N(0,σ2eI).
The resulting vector of genetic values u = Z g will therefore follow u ~ N(0,σ2uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.
The values utst = {ui}, i = 1,2,...,ntst, for a testing set are estimated individual-wise using (as predictors) all available observations in a training set as
ui = β'i (ytrn - Xtrnb)
where βi is a vector of weights that are found separately for each individual in the testing set, by minimizing the penalized mean squared error function
-σ2uG'trn,tst(i) βi + 1/2 β'i(σ2uGtrn + σ2eI)βi + λ J(βi)
where Gtrn,tst(i) is the ith column of the sub-matrix of G whose rows correspond to the training set and columns to the testing set; Gtrn is the sub-matrix corresponding to the training set; λ is the penalization parameter; and J(βi) is a penalty function given by
1/2(1-α)||βi||22 + α||βi||1
where 0 ≤ α ≤ 1, and ||βi||1 = ∑j=1|βij| and ||βi||22 = ∑j=1βij2 are the L1 and (squared) L2-norms, respectively.
Function 'SGP' calculates each individual solution using the function 'solveEN' (via the Coordinate Descent algorithm, see help(solveEN)
) by setting the argument Sigma
equal to
σ2uGtrn + σ2eI
and Gamma
equal to
σ2uGtrn,tst(i).
Function 'SGP.CV' performs cross-validation within the training data specified in argument trn
. Training data is divided into k
folds and the SGP is sequentially derived for (all individuals in) one fold (as testing set) using information from the remaining folds (as training set).
Value
Function 'SGP' returns a list object of the class 'SGP' for which methods coef
, predict
, plot
, and summary
exist. Functions 'net.plot' and 'path.plot' can be also used. It contains the elements:
-
b
: (vector) fixed effects solutions (including the intercept). -
Xb
: (vector) total fixed values (X b). -
u
: (matrix) total genetic values (u = Z g) for testing individuals (in rows) associated to each value of lambda (in columns). -
varU
,varE
,h2
: variance components solutions. -
alpha
: value for the elastic-net weights used. -
lambda
: (matrix) sequence of values of lambda used (in columns) for each testing individual (in rows). -
nsup
: (matrix) number of non-zero predictors at each solution given by lambda for each testing individual (in rows). -
file_beta
: path where regression coefficients are saved.
Function 'SGP.CV' returns a list object of length nCV
of the class 'SGP'. Optimal cross-validated penalization values can be obtained using thesummary
method. Method plot
is also available.
Examples
require(SFSI)
data(wheatHTP)
index = which(Y$trial %in% 1:12) # Use only a subset of data
Y = Y[index,]
M = scale(M[index,])/sqrt(ncol(M)) # Subset and scale markers
G = tcrossprod(M) # Genomic relationship matrix
Y0 = scale(as.matrix(Y[,4:6])) # Response variable
#---------------------------------------------------
# Single-trait model
#---------------------------------------------------
y = Y0[,1]
# Training and testing sets
tst = which(Y$trial %in% 2:3)
trn = seq_along(y)[-tst]
# Sparse genomic prediction
fm1 = SGP(y, K=G, trn=trn, tst=tst)
uHat = predict(fm1) # Predicted values for each testing element
out = summary(fm1) # Useful function to get results
corTST = out$accuracy # Testing set accuracy (correlation cor(y,yHat))
out$optCOR # SGP with maximum accuracy
B = coef(fm1) # Regression coefficients for all tst
B = coef(fm1, iy=1) # Coefficients for first tst (tst[1])
B = coef(fm1, ilambda=10) # Coefficients associated to the 10th lambda
B = coef(fm1, nsup=10) # Coefficients for which nsup=10
plot(fm1) # Penalization vs accuracy plot
plot(fm1, y.stat="MSE", ylab='Mean Square Error', xlab='Sparsity')
varU = fm1$varU
varE = fm1$varE
b = fm1$b
#---------------------------------------------------
# Predicting a testing set using a value of lambda
# obtained from cross-validation in a traning set
#---------------------------------------------------
# Run a cross validation in training set
fm2 = SGP.CV(y, K=G, varU=varU, varE=varE, b=b, trn=trn, nfolds=5, tag="1 5CV")
lambda = summary(fm2)$optCOR["lambda"]
# Fit the index with the obtained lambda
fm3 = SGP(y, K=G, varU=varU, varE=varE, b=b, trn=trn, tst=tst, lambda=lambda)
summary(fm3)$accuracy # Testing set accuracy
# Compare the accuracy with that of the non-sparse index (G-BLUP)
summary(fm1)$accuracy[fm1$nlambda,1] # we take the last one
#---------------------------------------------------
# Multi-trait SGP
#---------------------------------------------------
ID_geno = as.vector(row(Y0))
ID_trait = as.vector(col(Y0))
y = as.vector(Y0)
# Training and testing sets
tst = c(which(ID_trait==1)[Y$trial %in% 2:3],
which(ID_trait==2)[Y$trial %in% 2],
which(ID_trait==3)[Y$trial %in% 3])
trn = seq_along(y)[-tst]
fmMT = SGP(y=y, K=G, ID_geno=ID_geno, ID_trait=ID_trait,
trn=trn, tst=tst)
multitrait.plot(fmMT)