bwgs.cv {BWGS} | R Documentation |
Genomic Prediction with cross validation
Description
The bwgs.cv function carries out cross-validation using genotypic and phenotypic data from a reference population, with options for genotypic matrix processing and genomic breeding value estimation.
Usage
bwgs.cv(
geno,
pheno,
FIXED = "NULL",
MAXNA = 0.2,
MAF = 0.05,
pop.reduct.method = "NULL",
sample.pop.size = "NULL",
geno.reduct.method = "NULL",
reduct.marker.size = "NULL",
pval = "NULL",
r2 = "NULL",
MAP = "NULL",
geno.impute.method = "NULL",
predict.method = "NULL",
nFolds,
nTimes
)
Arguments
geno |
Matrix (n x m) of genotypes for the training population: n lines with m markers. Genotypes should be coded -1, 0, 1. Missing data are allowed and coded as NA. |
pheno |
Vector (n x 1) of "phenotypes", i.e. observations or pre-processed, corrected values. This vector should have no missing values, otherwise missing values (NA) will be omitted in both pheno and geno. In a first step, bwgs.cv checks whether rownames(geno) match with names(pheno). If not the case, the common elements (intersect) are selected in both geno and pheno for further analyses. If a MAP file is provided, the selected set of markers are also sorted out in MAP. |
FIXED |
A matrix of fixed effect, to be used with some methods such as those included in BGLR, MUST have same rownames as geno and coded(-1 0 1) |
MAXNA |
The maximum proportion of missing value which is admitted for filtering marker columns in geno. Default value is 0.2 |
MAF |
The minimum allele frequency for filtering marker colums in geno;default value is 0.05 |
pop.reduct.method |
Method for reducing the size of the training population. Can be used for teaching purposes, no real interest in real life if the entire population is already genotyped and phenotyped. Default value is NULL (all training set used). Proposed methods are:
|
sample.pop.size |
The size of the subset of individuals in the training set (both geno and pheno) selected by pop.reduct.method if not NULL. |
geno.reduct.method |
Allows sampling a subset of markers for speeding up computing time and/or avoid introducing more noise than informative markers. Options are:
|
reduct.marker.size |
Specifies the number of markers for the genotypic reduction using RMR (reduct.size < m). |
pval |
p value for ANO method, 0 < pval < 1. |
r2 |
Coefficient of linkage disequilibrium (LD). Setting 0<r2<1 if the genotypic reduction method is in LD or ANO+LD . |
MAP |
A matrix with markers in rows and at least ONE columns with colnames= "chrom". Used for computing r2 within linkage groups. |
geno.impute.method |
Allow missing marker data imputation using the two methods proposed in function A.mat of package rrBLUP, namely:
Default value is NULL. Note that these imputation methods are only suited when there are a few missing value, typically in marker data from SNP chips of KasPAR. They are NOT suited for imputing marker data from low density to high density designs, and when there are MANY missing Data as typically provided by GBS. More sophisticated software (e.g. Beagles, Browning & Browning 2016) should be used before BWGS. |
predict.method |
The options for genomic breeding value prediction methods. The available options are:
Several Bayesian methods, using the BGLR library:
A more detailed description of these methods can be found in Perez & de los Campos 2014 (http://genomics.cimmyt.org/BGLR-extdoc.pdf). Three semi-parametric methods:
|
nFolds |
Number of folds for the cross-validation. Smallest value recommended is nFolds = 3. |
nTimes |
Number of independent replicates for the cross-validation. Smallest value recommended is nTimes = 3. |
Value
The class bwgs.cv returns a list containing:
summary: Summary of cross-validation, including mean and standard deviation of predictive ability (i.e. correlation between phenotype and GEBV, estimated on the validation fold, then averaged over replicates (nTimes), Time taken by the computation and number of markers
cv: Vector of predictive abilities averaged over nFolds, for each of the nTimes replicates
sd: Standard deviation of the nTimes predictive abilities
MSEP: Square root of the mean-squared error of prediction, averaged over Ntimes
SDMSEP: Standard deviation of the Square root of the mean-squared error of prediction, averaged over Ntimes
bv_table: Matrix of dimension n x 4. Columns are:
Real BV, i.e. pheno vector
Predict BV: the nx1 vector of GEBVs
gpreSD: Standart deviation of estimated GEBV
CD: coefficient of determination for each GEBV, estimated as sqrt Note that gpredSD and CD are only available for methods using the BGLR library, namely GBLUP, EGBLUP, BA,BB,BC,BL,RKHS and MKRKHS. These two columns contain NA for methods RF, RR, LASSO, EN and SVM.
Examples
data(inra)
# Cross validation using GBLUP method
cv_gblup <- bwgs.cv(TRAIN47K, YieldBLUE,
geno.impute.method = "mni",
predict.method = "gblup",
nFolds = 10,
nTimes = 1)