R: Fitting a Linear Mixed model to calculate BLUP

6. BLUP estimation from Linear Mixed Model {SFSI}

R Documentation

Fitting a Linear Mixed model to calculate BLUP

Description

Solves the Linear Mixed Model and calculates the Best Linear Unbiased Predictor (BLUP)

Usage

fitBLUP(y, X = NULL, Z = NULL, K = NULL, trn = NULL,
        EVD = NULL, varU = NULL, varE = NULL,
        ID_geno = NULL, ID_trait = NULL, intercept = TRUE,
        BLUP = TRUE, method = c("REML","ML"),
        interval = c(1E-9,1E9), tol = 1E-8, maxiter = 1000,
        n.regions = 10, verbose = TRUE)

Arguments

`y`	(numeric vector) Response variable. It can be a matrix with each column representing a different response variable
`X`	(numeric matrix) Design matrix for fixed effects. When `X=NULL` a vector of ones is constructed only for the intercept (default)
`Z`	(numeric matrix) Design matrix for random effects. When `Z=NULL` an identity matrix is considered (default) thus G = K; otherwise G = Z K Z' is used
`K`	(numeric matrix) Kinship relationship matrix
`trn`	(integer vector) Which elements from vector `y` are to be used for training. When `trn = NULL`, non-NA entries in vector `y` will be used as training set
`EVD`	(list) Eigenvectors and eigenvalues from eigenvalue decomposition (EVD) of G corresponding to training data
`ID_geno`	(character/integer) For within-trait analysis only, vector with either names or indices mapping entries of the vector `y` to rows/columns of matrix `G`
`ID_trait`	(character/integer) For within-trait analysis only, vector with either names or indices mapping entries of the vector `y` to different traits
`varU`, `varE`	(numeric) Genetic and residual variances. When both `varU` and `varE` are not `NULL` they are not calculated; otherwise, the likelihood function (REML or ML) is optimized to search for the genetic/residual variances ratio
`intercept`	`TRUE` or `FALSE` to whether fit an intercept. When `FALSE`, the model assumes a null intercept
`BLUP`	`TRUE` or `FALSE` to whether return the random effects estimates
`method`	(character) Either 'REML' (Restricted Maximum Likelihood) or 'ML' (Maximum Likelihood)
`tol`	(numeric) Maximum error between two consecutive solutions (convergence tolerance) when finding the root of the log-likelihood's first derivative
`maxiter`	(integer) Maximum number of iterations to run before convergence is reached
`interval`	(numeric vector) Range of values in which the root is searched
`n.regions`	(numeric) Number of regions in which the searched 'interval' is divided for local optimization
`verbose`	`TRUE` or `FALSE` to whether show progress

Details

The basic linear mixed model that relates phenotypes with genetic values is of the form

y = X b + Z g + e

where y is a vector with the response, b is the vector of fixed effects, u is the vector of the (random) genetic effects of the genotypes, e is the vector of environmental residuals (random error), and X and Z are design matrices for the fixed and genetic effects, respectively. Genetic effects are assumed to follow a Normal distribution as g ~ N(0,σ²_uK), and the error terms are assumed e ~ N(0,σ²_eI).

The vector of total genetic values u = Z g will therefore follow u ~ N(0,σ²_uG) where G = Z K Z'. In the un-replicated case, Z = I is an identity matrix, and hence u = g and G = K.

The predicted values u_trn = {u_i}, i = 1,2,...,n_trn, corresponding to observed data (training set) are derived as

u_trn = B (y_trn - X_trnb)

where B is a matrix of weights given by

B = σ²_uG_trn (σ²_uG_trn + σ²_eI)^-1

where G_trn is the sub-matrix corresponding to the training set. This matrix can be rewritten as

B = G_trn (G_trn + θI)^-1

where θ = σ²_e/σ²_u is the residual/genetic variances ratio representing a ridge-like shrinkage parameter.

The matrix H = G_trn + θI in the above equation can be used to obtain predictions corresponding to un-observed data (testing set), u_tst = {u_i}, i = 1,2,...,n_tst, by

B = G_tst,trn (G_trn + θI)^-1

where G_tst,trn is the sub-matrix of G corresponding to the testing set (in rows) and training set (in columns).

Solutions are found using the GEMMA (Genome-wide Efficient Mixed Model Analysis) approach (Zhou & Stephens, 2012). First, the Brent's method is implemented to solve for the genetic/residual variances ratio (i.e., 1/θ) from the first derivative of the log-likelihood (either REML or ML). Then, variances σ²_u and σ²_e are calculated. Finally, b is obtained using Generalized Least Squares.

Value

Returns a list object that contains the elements:

b: (vector) fixed effects solutions (including the intercept).
u: (vector) total genetic values (u = Z g).
g: (vector) genetic effects solutions.
varU: random effect variance.
varE: residual variance.
h2: heritability.
convergence: (logical) whether Brent's method converged.
method: either 'REML' or 'ML' method used.

References

Zhou X, Stephens M (2012). Genome-wide efficient mixed-model analysis for association studies. Nature Genetics, 44(7), 821-824

Examples

  require(SFSI)
  data(wheatHTP)
  
  index = which(Y$trial %in% 1:20)    # Use only a subset of data
  Y = Y[index,]
  M = scale(M[index,])/sqrt(ncol(M))  # Subset and scale markers
  G = tcrossprod(M)                   # Genomic relationship matrix
  Y0 = scale(as.matrix(Y[,4:6]))      # Response variable
  
  #---------------------------------------------------
  # Single-trait model
  #---------------------------------------------------
  y = Y0[,1]     
  
  # Training and testing sets
  tst = which(Y$trial %in% 1:3)
  trn = seq_along(y)[-tst]
  
  # Kinship-based model
  fm1 = fitBLUP(y, K=G, trn=trn)
  
  head(fm1$g)                  # Genetic effects
  plot(y[tst],fm1$yHat[tst])   # Predicted vs observed values in testing set
  cor(y[tst],fm1$yHat[tst])    # Prediction accuracy in testing set
  cor(y[trn],fm1$yHat[trn])    # Prediction accuracy in training set
  fm1$varU                     # Genetic variance
  fm1$varE                     # Residual variance
  fm1$h2                       # Heritability
  fm1$b                        # Fixed effects
  
  # Markers-based model
  fm2 = fitBLUP(y, Z=M, trn=trn)
  head(fm2$g)                   # Marker effects
  all.equal(fm1$yHat, fm2$yHat)
  fm2$varU                      # Genetic variance
  fm2$varU*sum(apply(M,2,var))
  
  #---------------------------------------------------
  # Multiple response variables
  #---------------------------------------------------
  ID_geno = as.vector(row(Y0))
  ID_trait = as.vector(col(Y0))
  y = as.vector(Y0)
  
  # Training and testing sets
  tst = c(which(ID_trait==1)[Y$trial %in% 1:3],
          which(ID_trait==2)[Y$trial %in% 1:3],
          which(ID_trait==3)[Y$trial %in% 1:3])
  trn = seq_along(y)[-tst]
  
  # Across traits model
  fm3 = fitBLUP(y, K=G, ID_geno=ID_geno, trn=trn)
  plot(fm1$yHat,fm3$yHat[ID_trait==1])  # different from the single-trait model
  
  # Within traits model: pass an index for traits
  fm4 = fitBLUP(y, K=G, ID_geno=ID_geno, ID_trait=ID_trait, trn=trn)
  plot(fm1$yHat,fm4$yHat[ID_trait==1])  # equal to the single-trait model

[Package SFSI version 1.4 Index]