R: Estimate genome-wide prediction accuracy using...

x.val {PopVar}

R Documentation

Estimate genome-wide prediction accuracy using cross-validation

Description

x.val performs cross-validation (CV) to estimate the accuracy of genome-wide prediction (otherwise known as genomic selection) for a specific training population (TP), i.e. a set of individuals for which phenotypic and genotypic data is available. Cross-validation can be conducted via one of two methods within x.val, see Details for more information.

         NOTE - \code{x.val}, specifically \code{\link[BGLR]{BGLR}} writes and reads files to disk so it is highly recommended to set your working directory

Usage

x.val(
  G.in = NULL,
  y.in = NULL,
  min.maf = 0.01,
  mkr.cutoff = 0.5,
  entry.cutoff = 0.5,
  remove.dups = TRUE,
  impute = "EM",
  frac.train = 0.6,
  nCV.iter = 100,
  nFold = NULL,
  nFold.reps = 1,
  return.estimates = FALSE,
  CV.burnIn = 750,
  CV.nIter = 1500,
  models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"),
  saveAt = tempdir()
)

Arguments

`G.in`	`Matrix` of genotypic data. First row contains marker names and the first column contains entry (taxa) names. Genotypes should be coded as follows: `1`: homozygous for minor allele `0`: heterozygous `-1`: homozygous for major allele `NA`: missing data Imputed genotypes can be passed, see `impute` below for details TIP - Set header=`FALSE` within `read.table` or `read.csv` when importing a tab-delimited file containing data for `G.in`.
`y.in`	`Matrix` of phenotypic data. First column contains entry (taxa) names found in `G.in`, regardless of whether the entry has a phenotype for any or all traits. Additional columns contain phenotypic data; column names should reflect the trait name(s). TIP - Set header=`TRUE` within `read.table` or `read.csv` when importing a tab-delimited file containing dat
`min.maf`	Optional `numeric` indicating a minimum minor allele frequency (MAF) when filtering `G.in`. Markers with an MAF < `min.maf` will be removed. Default is `0.01` to remove monomorphic markers. Set to `0` for no filtering.
`mkr.cutoff`	Optional `numeric` indicating the maximum missing data per marker when filtering `G.in`. Markers missing > `mkr.cutoff` data will be removed. Default is `0.50`. Set to `1` for no filtering.
`entry.cutoff`	Optional `numeric` indicating the maximum missing genotypic data per entry allowed when filtering `G.in`. Entries missing > `entry.cutoff` marker data will be removed. Default is `0.50`. Set to `1` for no filtering.
`remove.dups`	Optional `logical`. If `TRUE` duplicate entries in the genotype matrix, if present, will be removed. This step may be necessary for missing marker imputation (see `impute`). Default is `TRUE`.
`impute`	Options include `c("EM", "mean", "pass")`. By default (i.e. `"EM"`), after filtering missing genotypic data will be imputed via the EM algorithm implemented in `rrBLUP-package` (Endelman, 2011; Poland et al., 2012). If `"mean"` missing genotypic data will be imputed via the 'marker mean' method, also implemented in `rrBLUP-package`. Enter `"pass"` if a pre-filtered and imputed genotype matrix is provided to `G.in`.
`frac.train`	Optional `numeric` indicating the fraction of the TP that is used to estimate marker effects (i.e. the prediction set) under cross-validation (CV) method 1 (see `Details`). The remaining `(1-frac.trait)` of the TP will then comprise the prediction set.
`nCV.iter`	Optional `integer` indicating the number of times to iterate CV method 1 described in `Details`. Default is `100`.
`nFold`	Optional `integer`. If a number is provided, denoting the number of "folds", then CV will be conducted using CV method 2 (see `Details`). Default is `NULL`, resulting in the default use of the CV method 1.
`nFold.reps`	Optional `integer` indicating the number of times CV method 2 is repeated. The CV accuracy returned is the average r of each rep. Default is `1`.
`return.estimates`	Optional `logical`. If `TRUE` additional items including the marker effect and beta estimates from the selected prediction model (i.e. highest CV accuracy) will be returned.
`CV.burnIn`	Optional `integer` argument used by `BGLR` when fitting Bayesian models. Default is `750`.
`CV.nIter`	Optional `integer` argument used by `BGLR` (de los Compos and Rodriguez, 2014) when fitting Bayesian models. Default is `1500`.
`models`	Optional `character vector` of the regression models to be used in CV and to estimate marker effects. Options include `rrBLUP, BayesA, BayesB, BayesC, BL, BRR`, one or more may be included at a time. By default all models are tested.
`saveAt`	When using models other than "rrBLUP" (i.e. Bayesian models), this is a path and prefix for saving temporary files the are produced by the `BGLR` function.

Details

Two CV methods are available within PopVar:

CV method 1: During each iteration a training (i.e. model training) set will be randomly sampled from the TP of size N*(frac.train), where N is the size of the TP, and the remainder of the TP is assigned to the validation set. The accuracies of individual models are expressed as average Pearson's correlation coefficient (r) between the genome estimated breeding value (GEBV) and observed phenotypic values in the validation set across all nCV.iter iterations. Due to its amendibility to various TP sizes, CV method 1 is the default CV method in pop.predict.
CV method 2: nFold independent validation sets are sampled from the TP and predicted by the remainder. For example, if nFold = 10 the TP will be split into 10 equal sets, each containing 1/10-th of the TP, which will be predicted by the remaining 9/10-ths of the TP. The accuracies of individual models are expressed as the average (r) between the GEBV and observed phenotypic values in the validation set across all nFold folds. The process can be repeated nFold.reps times with nFold new independent sets being sampled each replication, in which case the reported prediction accuracies are averages across all folds and replications.

Value

A list containing:

CVs A dataframe of CV results for each trait/model combination specified
If return.estimates is TRUE the additional items will be returned:
- models.used A list of the models chosen to estimate marker effects for each trait
- mkr.effects A vector of marker effect estimates for each trait generated by the respective prediction model used
- betas A list of beta values for each trait generated by the respective prediction model used

Examples


## CV using method 1 with 25 iterations
CV.mthd1 <- x.val(G.in = G.in_ex, y.in = y.in_ex, nCV.iter = 25)
CV.mthd1$CVs

## CV using method 2 with 5 folds and 3 replications
x.val(G.in = G.in_ex, y.in = y.in_ex, nFold = 5, nFold.reps = 3)

[Package PopVar version 1.3.1 Index]