x.val {PopVar}R Documentation

Estimate genome-wide prediction accuracy using cross-validation

Description

x.val performs cross-validation (CV) to estimate the accuracy of genome-wide prediction (otherwise known as genomic selection) for a specific training population (TP), i.e. a set of individuals for which phenotypic and genotypic data is available. Cross-validation can be conducted via one of two methods within x.val, see Details for more information.

         NOTE - \code{x.val}, specifically \code{\link[BGLR]{BGLR}} writes and reads files to disk so it is highly recommended to set your working directory

Usage

x.val(
  G.in = NULL,
  y.in = NULL,
  min.maf = 0.01,
  mkr.cutoff = 0.5,
  entry.cutoff = 0.5,
  remove.dups = TRUE,
  impute = "EM",
  frac.train = 0.6,
  nCV.iter = 100,
  nFold = NULL,
  nFold.reps = 1,
  return.estimates = FALSE,
  CV.burnIn = 750,
  CV.nIter = 1500,
  models = c("rrBLUP", "BayesA", "BayesB", "BayesC", "BL", "BRR"),
  saveAt = tempdir()
)

Arguments

G.in

Matrix of genotypic data. First row contains marker names and the first column contains entry (taxa) names. Genotypes should be coded as follows:

  • 1: homozygous for minor allele

  • 0: heterozygous

  • -1: homozygous for major allele

  • NA: missing data

  • Imputed genotypes can be passed, see impute below for details

TIP - Set header=FALSE within read.table or read.csv when importing a tab-delimited file containing data for G.in.

y.in

Matrix of phenotypic data. First column contains entry (taxa) names found in G.in, regardless of whether the entry has a phenotype for any or all traits. Additional columns contain phenotypic data; column names should reflect the trait name(s). TIP - Set header=TRUE within read.table or read.csv when importing a tab-delimited file containing dat

min.maf

Optional numeric indicating a minimum minor allele frequency (MAF) when filtering G.in. Markers with an MAF < min.maf will be removed. Default is 0.01 to remove monomorphic markers. Set to 0 for no filtering.

mkr.cutoff

Optional numeric indicating the maximum missing data per marker when filtering G.in. Markers missing > mkr.cutoff data will be removed. Default is 0.50. Set to 1 for no filtering.

entry.cutoff

Optional numeric indicating the maximum missing genotypic data per entry allowed when filtering G.in. Entries missing > entry.cutoff marker data will be removed. Default is 0.50. Set to 1 for no filtering.

remove.dups

Optional logical. If TRUE duplicate entries in the genotype matrix, if present, will be removed. This step may be necessary for missing marker imputation (see impute). Default is TRUE.

impute

Options include c("EM", "mean", "pass"). By default (i.e. "EM"), after filtering missing genotypic data will be imputed via the EM algorithm implemented in rrBLUP-package (Endelman, 2011; Poland et al., 2012). If "mean" missing genotypic data will be imputed via the 'marker mean' method, also implemented in rrBLUP-package. Enter "pass" if a pre-filtered and imputed genotype matrix is provided to G.in.

frac.train

Optional numeric indicating the fraction of the TP that is used to estimate marker effects (i.e. the prediction set) under cross-validation (CV) method 1 (see Details). The remaining (1-frac.trait) of the TP will then comprise the prediction set.

nCV.iter

Optional integer indicating the number of times to iterate CV method 1 described in Details. Default is 100.

nFold

Optional integer. If a number is provided, denoting the number of "folds", then CV will be conducted using CV method 2 (see Details). Default is NULL, resulting in the default use of the CV method 1.

nFold.reps

Optional integer indicating the number of times CV method 2 is repeated. The CV accuracy returned is the average r of each rep. Default is 1.

return.estimates

Optional logical. If TRUE additional items including the marker effect and beta estimates from the selected prediction model (i.e. highest CV accuracy) will be returned.

CV.burnIn

Optional integer argument used by BGLR when fitting Bayesian models. Default is 750.

CV.nIter

Optional integer argument used by BGLR (de los Compos and Rodriguez, 2014) when fitting Bayesian models. Default is 1500.

models

Optional character vector of the regression models to be used in CV and to estimate marker effects. Options include rrBLUP, BayesA, BayesB, BayesC, BL, BRR, one or more may be included at a time. By default all models are tested.

saveAt

When using models other than "rrBLUP" (i.e. Bayesian models), this is a path and prefix for saving temporary files the are produced by the BGLR function.

Details

Two CV methods are available within PopVar:

Value

A list containing:

Examples


## CV using method 1 with 25 iterations
CV.mthd1 <- x.val(G.in = G.in_ex, y.in = y.in_ex, nCV.iter = 25)
CV.mthd1$CVs

## CV using method 2 with 5 folds and 3 replications
x.val(G.in = G.in_ex, y.in = y.in_ex, nFold = 5, nFold.reps = 3)


[Package PopVar version 1.3.1 Index]