cvFPM {RFPM} | R Documentation |
Cross-Validation Optimization of the Floating Percentile Model
Description
Use leave-one-out (LOO) or k-folds cross-validation methods to calculate parameter inputs that optimize benchmark performance while attempting to account for out-of-sample uncertainty
Usage
cvFPM(
data,
paramList,
FN_crit = seq(0.1, 0.9, by = 0.05),
alpha.test = seq(0.05, 0.5, by = 0.05),
k = 5,
seed = 1,
plot = TRUE,
which = c(1, 2, 3, 4),
colors = heat.colors(10),
colsteps = 100,
...
)
Arguments
data |
data.frame containing, at a minimum, chemical concentrations as columns and a logical |
paramList |
character vector of column names of chemical concentration variables in |
FN_crit |
numeric vector of values between 0 and 1 indicating false negative threshold for floating percentile model benchmark selection (default = |
alpha.test |
numeric vector of values between 0 and 1 indicating type-I error rate for chemical selection (default = |
k |
integer with length = 1 and value > 1 indicating how many folds to include in k-folds type cross-validation method (default = |
seed |
integer with length = 1 indicating the random seed to set for assigning k classes for k-folds cross-validation (default = |
plot |
whether to plot the output of |
which |
numeric or character indicating which type of plot to generate (see Details; default = |
colors |
values recognizible as colors to be passed to |
colsteps |
integer; number of discrete steps to interpolate colors in |
... |
additional arguments passed to |
Details
cvFPM
allows users to "tune" the FN_crit
and alpha.test
arguments used by FPM
(via chemSig
). This is achieved by splitting the empirical dataset into "test" and "training" subsets,
calculating benchmarks for the training set, and then evaluating the benchmarks' ability to predict Hits in the out-of-sample test set. The output of cvFPM
is similar to optimFPM
: optimal FN_crit
and alpha.test
inputs
based on several classification metrics (see ?optimFPM
for more details). The key difference between cvFPM
and optimFPM
is that cvFPM
attempts to account for
out-of-sample uncertainty, whereas optimFPM
is specific (and potentially overly specific) to the full FPM dataset. Because the primary use of FPM SQBs will be to predict
toxicity in sediment samples where toxicity is not measured, the FPM should be parameterized in a way that best accounts for out-of-sample uncertainty. In other words, while FPM
generates classification metrics like "overall reliability" for SQBs, they are unlikely to achieve the expected reliability when applied to new samples. This is an inherent limitation of SQBs,
which the FPM cannot fully address but that cvFPM
considers.
Two cross-validation methods are available, controlled through the k
argument. If the user specifies k = NULL
or k = nrow(data)
, then leave-one-out (LOO) is used.
LOO is computationally intensive but is better suited to small datasets.
Other values of k
(e.g., the default k = 5
) will result in applying a k-folds cross-validation method, which uses larger test subsets (and smaller training sets),
evaluates fewer scenarios, and greatly improves runtime for large datasets. The seed
argument can be used to establish a consistent result; if is.null(seed)
, the result will vary based on randomization.
Allowing for randomization may be desireable to understand between-run variability in cvFPM
output caused by re-sampling of training/test sets.
By setting plot = TRUE
(the default), the outcome of cross-validation can be visualized over the range of FN_crit
values considered. Visualizing the results
can inform the user about variability in the cross-validation process, ranges of potentially reasonable FN_crit
values, etc. Graphical output depends on
whether many FN_crit
and/or many alpha.test
are evaluated, with line plots or heat plots alternately generated.
IMPORTANT: cvFPM
is not in itself optimized for runtime - running cvFPM
can take a long time
The which
argument can be used to specify which of the metric-specific plots should be generated when plot = TRUE
. Inputs
to which
are, by default, c(1, 2, 3, 4)
.
Value
data.frame of metric output, base R graphical output
See Also
chemSig, FPM, optimFPM
Examples
paramList = c("Cd", "Cu", "Fe", "Mn", "Ni", "Pb", "Zn")
par(mfrow = c(2,2))
cvFPM(h.tristate, paramList, seq(0.1, 0.9, 0.1), 0.05)