test_spectra {waves}R Documentation

Test the performance of spectral models

Description

Wrapper that trains models based spectral data to predict reference values and reports model performance statistics

Usage

test_spectra(
  train.data,
  num.iterations,
  test.data = NULL,
  pretreatment = 1,
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  best.model.metric = "RMSE",
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  split.test = FALSE,
  seed = 1,
  verbose = TRUE,
  wavelengths = deprecated(),
  preprocessing = deprecated(),
  output.summary = deprecated(),
  rf.variable.importance = deprecated()
)

Arguments

train.data

data.frame object of spectral data for input into a spectral prediction model. First column contains unique identifiers, second contains reference values, followed by spectral columns. Include no other columns to right of spectra! Column names of spectra must start with "X" and reference column must be named "reference".

num.iterations

Number of training iterations to perform

test.data

data.frame with same specifications as df. Use if specific test set is desired for hyperparameter tuning. If NULL, function will automatically train with a stratified sample of 70%. Default is NULL.

pretreatment

Number or list of numbers 1:13 corresponding to desired pretreatment method(s):

  1. Raw data (default)

  2. Standard normal variate (SNV)

  3. SNV and first derivative

  4. SNV and second derivative

  5. First derivative

  6. Second derivative

  7. Savitzky–Golay filter (SG)

  8. SNV and SG

  9. Gap-segment derivative (window size = 11)

  10. SG and first derivative (window size = 5)

  11. SG and first derivative (window size = 11)

  12. SG and second derivative (window size = 5)

  13. SG and second derivative (window size = 11)

k.folds

Number indicating the number of folds for k-fold cross-validation during model training. Default is 5.

proportion.train

Fraction of samples to include in the training set. Default is 0.7.

tune.length

Number delineating search space for tuning of the PLSR hyperparameter ncomp. Must be set to 5 when using the random forest algorithm (model.method == rf). Default is 50.

model.method

Model type to use for training. Valid options include:

  • "pls": Partial least squares regression (Default)

  • "rf": Random forest

  • "svmLinear": Support vector machine with linear kernel

  • "svmRadial": Support vector machine with radial kernel

best.model.metric

Metric used to decide which model is best. Must be either "RMSE" or "Rsquared"

stratified.sampling

If TRUE, training and test sets will be selected using stratified random sampling. This term is only used if test.data == NULL. Default is TRUE.

cv.scheme

A cross validation (CV) scheme from Jarquín et al., 2017. Options for cv.scheme include:

  • "CV1": untested lines in tested environments

  • "CV2": tested lines in tested environments

  • "CV0": tested lines in untested environments

  • "CV00": untested lines in untested environments

trial1

data.frame object that is for use only when cv.scheme is provided. Contains the trial to be tested in subsequent model training functions. The first column contains unique identifiers, second contains genotypes, third contains reference values, followed by spectral columns. Include no other columns to right of spectra! Column names of spectra must start with "X", reference column must be named "reference", and genotype column must be named "genotype".

trial2

data.frame object that is for use only when cv.scheme is provided. This data.frame contains a trial that has overlapping genotypes with trial1 but that were grown in a different site/year (different environment). Formatting must be consistent with trial1.

trial3

data.frame object that is for use only when cv.scheme is provided. This data.frame contains a trial that may or may not contain genotypes that overlap with trial1. Formatting must be consistent with trial1.

split.test

boolean that allows for a fixed training set and a split test set. Example// train model on data from two breeding programs and a stratified subset (70%) of a third and test on the remaining samples (30%) of the third. If FALSE, the entire provided test set test.data will remain as a testing set or if none is provided, 30% of the provided train.data will be used for testing. Default is FALSE.

seed

Integer to be used internally as input for set.seed(). Only used if stratified.sampling = TRUE. In all other cases, seed is set to the current iteration number. Default is 1.

verbose

If TRUE, the number of rows removed through filtering will be printed to the console. Default is TRUE.

wavelengths

DEPRECATED wavelengths is no longer supported; this information is now inferred from df column names

preprocessing

DEPRECATED please use pretreatment to specify the specific pretreatment(s) to test. For behavior identical to that of preprocessing = TRUE, set pretreatment = 1:13'.

output.summary

DEPRECATED output.summary = FALSE is no longer supported; a summary of output is always returned alongside the full performance statistics.

rf.variable.importance

DEPRECATED rf.variable.importance = FALSE is no longer supported; variable importance results are always returned if the model.method is set to 'pls' or 'rf'.

Details

Calls pretreat_spectra, format_cv, and train_spectra functions.

Value

list of 5 objects:

  1. 'model.list' is a list of trained model objects, one for each pretreatment method specified by the pretreatment argument. Each model is trained with all rows of df.

  2. 'summary.model.performance' is a data.frame containing summary statistics across all model training iterations and pretreatments. See below for a description of the summary statistics provided.

  3. 'model.performance' is a data.frame containing performance statistics for each iteration of model training separately (see below).

  4. 'predictions' is a data.frame containing both reference and predicted values for each test set entry in each iteration of model training.

  5. 'importance' is a data.frame containing variable importance results for each wavelength at each iteration of model training. If model.method is not "pls" or "rf", this list item is NULL.

'summary.model.performance' and 'model.performance' data.frames summary statistics include:

Author(s)

Jenna Hershberger jmh579@cornell.edu

Examples


library(magrittr)
ikeogu.2017 %>%
  dplyr::rename(reference = DMC.oven,
                unique.id = sample.id) %>%
  dplyr::select(unique.id, reference, dplyr::starts_with("X")) %>%
  na.omit() %>%
  test_spectra(
    train.data = .,
    tune.length = 3,
    num.iterations = 3,
    pretreatment = 1
  )


[Package waves version 0.2.5 Index]