| train_spectra {waves} | R Documentation | 
Train a model based predict reference values with spectral data
Description
Trains spectral prediction models using one of several algorithms and sampling procedures.
Usage
train_spectra(
  df,
  num.iterations,
  test.data = NULL,
  k.folds = 5,
  proportion.train = 0.7,
  tune.length = 50,
  model.method = "pls",
  best.model.metric = "RMSE",
  stratified.sampling = TRUE,
  cv.scheme = NULL,
  trial1 = NULL,
  trial2 = NULL,
  trial3 = NULL,
  split.test = FALSE,
  seed = 1,
  verbose = TRUE,
  save.model = deprecated(),
  rf.variable.importance = deprecated(),
  output.summary = deprecated(),
  return.model = deprecated()
)
Arguments
df | 
 
  | 
num.iterations | 
 Number of training iterations to perform  | 
test.data | 
 
  | 
k.folds | 
 Number indicating the number of folds for k-fold cross-validation during model training. Default is 5.  | 
proportion.train | 
 Fraction of samples to include in the training set. Default is 0.7.  | 
tune.length | 
 Number delineating search space for tuning of the PLSR
hyperparameter   | 
model.method | 
 Model type to use for training. Valid options include: 
  | 
best.model.metric | 
 Metric used to decide which model is best. Must be either "RMSE" or "Rsquared"  | 
stratified.sampling | 
 If   | 
cv.scheme | 
 A cross validation (CV) scheme from Jarquín et al., 2017.
Options for  
  | 
trial1 | 
 
  | 
trial2 | 
 
  | 
trial3 | 
 
  | 
split.test | 
 boolean that allows for a fixed training set and a split
test set. Example// train model on data from two breeding programs and a
stratified subset (70%) of a third and test on the remaining samples
(30%)  of the third. If   | 
seed | 
 Integer to be used internally as input for   | 
verbose | 
 If   | 
save.model | 
 DEPRECATED   | 
rf.variable.importance | 
 DEPRECATED
  | 
output.summary | 
 DEPRECATED   | 
return.model | 
 DEPRECATED   | 
Value
list of the following:
-  
modelis a model object trained with all rows ofdf. -  
summary.model.performanceis adata.framewith model performance statistics in summary format (2 rows, one with mean and one with standard deviation of all training iterations). -  
full.model.performanceis adata.framewith model performance statistics in long format (number of rows =num.iterations) -  
predictionsis adata.framecontaining predicted values for each test set entry at each iteration of model training. -  
importanceis adata.framethat contains variable importance for each wavelength. Only available formodel.methodoptions "rf" and "pls". 
Included summary statistics:
Tuned parameters depending on the model algorithm:
-  
Best.n.comp, the best number of components
 -  
Best.ntree, the best number of trees in an RF model
 -  
Best.mtry, the best number of variables to include at every decision point in an RF model
 
-  
 -  
RMSECV, the root mean squared error of cross-validation
 -  
R2cv, the coefficient of multiple determination of cross-validation for PLSR models
 -  
RMSEP, the root mean squared error of prediction
 -  
R2p, the squared Pearson’s correlation between predicted and observed test set values
 -  
RPD, the ratio of standard deviation of observed test set values to RMSEP
 -  
RPIQ, the ratio of performance to interquartile difference
 -  
CCC, the concordance correlation coefficient
 -  
Bias, the average difference between the predicted and observed values
 -  
SEP, the standard error of prediction
 -  
R2sp, the squared Spearman’s rank correlation between predicted and observed test set values
 
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
library(magrittr)
ikeogu.2017 %>%
  dplyr::filter(study.name == "C16Mcal") %>%
  dplyr::rename(reference = DMC.oven,
                unique.id = sample.id) %>%
  dplyr::select(unique.id, reference, dplyr::starts_with("X")) %>%
  na.omit() %>%
  train_spectra(
    df = .,
    tune.length = 3,
    num.iterations = 3,
    best.model.metric = "RMSE",
    stratified.sampling = TRUE
  ) %>%
  summary()