train_spectra {waves} | R Documentation |
Train a model based predict reference values with spectral data
Description
Trains spectral prediction models using one of several algorithms and sampling procedures.
Usage
train_spectra(
df,
num.iterations,
test.data = NULL,
k.folds = 5,
proportion.train = 0.7,
tune.length = 50,
model.method = "pls",
best.model.metric = "RMSE",
stratified.sampling = TRUE,
cv.scheme = NULL,
trial1 = NULL,
trial2 = NULL,
trial3 = NULL,
split.test = FALSE,
seed = 1,
verbose = TRUE,
save.model = deprecated(),
rf.variable.importance = deprecated(),
output.summary = deprecated(),
return.model = deprecated()
)
Arguments
df |
|
num.iterations |
Number of training iterations to perform |
test.data |
|
k.folds |
Number indicating the number of folds for k-fold cross-validation during model training. Default is 5. |
proportion.train |
Fraction of samples to include in the training set. Default is 0.7. |
tune.length |
Number delineating search space for tuning of the PLSR
hyperparameter |
model.method |
Model type to use for training. Valid options include:
|
best.model.metric |
Metric used to decide which model is best. Must be either "RMSE" or "Rsquared" |
stratified.sampling |
If |
cv.scheme |
A cross validation (CV) scheme from Jarquín et al., 2017.
Options for
|
trial1 |
|
trial2 |
|
trial3 |
|
split.test |
boolean that allows for a fixed training set and a split
test set. Example// train model on data from two breeding programs and a
stratified subset (70%) of a third and test on the remaining samples
(30%) of the third. If |
seed |
Integer to be used internally as input for |
verbose |
If |
save.model |
DEPRECATED |
rf.variable.importance |
DEPRECATED
|
output.summary |
DEPRECATED |
return.model |
DEPRECATED |
Value
list of the following:
-
model
is a model object trained with all rows ofdf
. -
summary.model.performance
is adata.frame
with model performance statistics in summary format (2 rows, one with mean and one with standard deviation of all training iterations). -
full.model.performance
is adata.frame
with model performance statistics in long format (number of rows =num.iterations
) -
predictions
is adata.frame
containing predicted values for each test set entry at each iteration of model training. -
importance
is adata.frame
that contains variable importance for each wavelength. Only available formodel.method
options "rf" and "pls".
Included summary statistics:
Tuned parameters depending on the model algorithm:
-
Best.n.comp, the best number of components
-
Best.ntree, the best number of trees in an RF model
-
Best.mtry, the best number of variables to include at every decision point in an RF model
-
-
RMSECV, the root mean squared error of cross-validation
-
R2cv, the coefficient of multiple determination of cross-validation for PLSR models
-
RMSEP, the root mean squared error of prediction
-
R2p, the squared Pearson’s correlation between predicted and observed test set values
-
RPD, the ratio of standard deviation of observed test set values to RMSEP
-
RPIQ, the ratio of performance to interquartile difference
-
CCC, the concordance correlation coefficient
-
Bias, the average difference between the predicted and observed values
-
SEP, the standard error of prediction
-
R2sp, the squared Spearman’s rank correlation between predicted and observed test set values
Author(s)
Jenna Hershberger jmh579@cornell.edu
Examples
library(magrittr)
ikeogu.2017 %>%
dplyr::filter(study.name == "C16Mcal") %>%
dplyr::rename(reference = DMC.oven,
unique.id = sample.id) %>%
dplyr::select(unique.id, reference, dplyr::starts_with("X")) %>%
na.omit() %>%
train_spectra(
df = .,
tune.length = 3,
num.iterations = 3,
best.model.metric = "RMSE",
stratified.sampling = TRUE
) %>%
summary()