format_cv {waves} | R Documentation |
Format multiple trials with or without overlapping genotypes into training and test sets according to user-provided cross validation scheme
Description
Standalone function that is also used within
train_spectra
to divide trials or studies into training and
test sets based on overlap in trial environments and genotype entries
Usage
format_cv(
trial1,
trial2,
trial3 = NULL,
cv.scheme,
stratified.sampling = TRUE,
proportion.train = 0.7,
seed = NULL,
remove.genotype = FALSE
)
Arguments
trial1 |
|
trial2 |
|
trial3 |
|
cv.scheme |
A cross validation (CV) scheme from Jarquín et al., 2017.
Options for
|
stratified.sampling |
If |
proportion.train |
Fraction of samples to include in the training set. Default is 0.7. |
seed |
Number used in the function |
remove.genotype |
boolean that, if |
Details
Use of a cross-validation scheme requires a column in the input
data.frame
named "genotype" to ensure proper sorting of training and
test sets. Variables trial1
and trial2
are required, while
trial 3
is optional.
Value
List of data.frames ($train.set, $test.set) compiled according to user-provided cross validation scheme.
Author(s)
Jenna Hershberger jmh579@cornell.edu
References
Jarquín, D., C. Lemes da Silva, R. C. Gaynor, J. Poland, A. Fritz, R. Howard, S. Battenfield, and J. Crossa. 2017. Increasing genomic-enabled prediction accuracy by modeling genotype × environment interactions in Kansas wheat. Plant Genome 10(2):1-15. <doi:10.3835/plantgenome2016.12.0130>
Examples
# Must have a column called "genotype", so we'll create a fake one for now
# We will use CV00, which does not require any overlap in genotypes
# In real scenarios, CV schemes that rely on genotypes should not be applied
# when genotypes are unknown, as in this case.
library(magrittr)
trials <- ikeogu.2017 %>%
dplyr::mutate(genotype = 1:nrow(ikeogu.2017)) %>% # fake for this example
dplyr::rename(reference = DMC.oven) %>%
dplyr::select(
study.name, sample.id, genotype, reference,
tidyselect::starts_with("X")
)
trial1 <- trials %>%
dplyr::filter(study.name == "C16Mcal") %>%
dplyr::select(-study.name)
trial2 <- trials %>%
dplyr::filter(study.name == "C16Mval") %>%
dplyr::select(-study.name)
cv.list <- format_cv(
trial1 = trial1, trial2 = trial2, cv.scheme = "CV00",
stratified.sampling = FALSE, remove.genotype = TRUE
)
cv.list$train.set[1:5, 1:5]
cv.list$test.set[1:5, 1:5]