| cvFPM {RFPM} | R Documentation |
Cross-Validation Optimization of the Floating Percentile Model
Description
Use leave-one-out (LOO) or k-folds cross-validation methods to calculate parameter inputs that optimize benchmark performance while attempting to account for out-of-sample uncertainty
Usage
cvFPM(
data,
paramList,
FN_crit = seq(0.1, 0.9, by = 0.05),
alpha.test = seq(0.05, 0.5, by = 0.05),
k = 5,
seed = 1,
plot = TRUE,
which = c(1, 2, 3, 4),
colors = heat.colors(10),
colsteps = 100,
...
)
Arguments
data |
data.frame containing, at a minimum, chemical concentrations as columns and a logical |
paramList |
character vector of column names of chemical concentration variables in |
FN_crit |
numeric vector of values between 0 and 1 indicating false negative threshold for floating percentile model benchmark selection (default = |
alpha.test |
numeric vector of values between 0 and 1 indicating type-I error rate for chemical selection (default = |
k |
integer with length = 1 and value > 1 indicating how many folds to include in k-folds type cross-validation method (default = |
seed |
integer with length = 1 indicating the random seed to set for assigning k classes for k-folds cross-validation (default = |
plot |
whether to plot the output of |
which |
numeric or character indicating which type of plot to generate (see Details; default = |
colors |
values recognizible as colors to be passed to |
colsteps |
integer; number of discrete steps to interpolate colors in |
... |
additional arguments passed to |
Details
cvFPM allows users to "tune" the FN_crit and alpha.test arguments used by FPM (via chemSig). This is achieved by splitting the empirical dataset into "test" and "training" subsets,
calculating benchmarks for the training set, and then evaluating the benchmarks' ability to predict Hits in the out-of-sample test set. The output of cvFPM is similar to optimFPM: optimal FN_crit and alpha.test inputs
based on several classification metrics (see ?optimFPM for more details). The key difference between cvFPM and optimFPM is that cvFPM attempts to account for
out-of-sample uncertainty, whereas optimFPM is specific (and potentially overly specific) to the full FPM dataset. Because the primary use of FPM SQBs will be to predict
toxicity in sediment samples where toxicity is not measured, the FPM should be parameterized in a way that best accounts for out-of-sample uncertainty. In other words, while FPM
generates classification metrics like "overall reliability" for SQBs, they are unlikely to achieve the expected reliability when applied to new samples. This is an inherent limitation of SQBs,
which the FPM cannot fully address but that cvFPM considers.
Two cross-validation methods are available, controlled through the k argument. If the user specifies k = NULL or k = nrow(data), then leave-one-out (LOO) is used.
LOO is computationally intensive but is better suited to small datasets.
Other values of k (e.g., the default k = 5) will result in applying a k-folds cross-validation method, which uses larger test subsets (and smaller training sets),
evaluates fewer scenarios, and greatly improves runtime for large datasets. The seed argument can be used to establish a consistent result; if is.null(seed), the result will vary based on randomization.
Allowing for randomization may be desireable to understand between-run variability in cvFPM output caused by re-sampling of training/test sets.
By setting plot = TRUE (the default), the outcome of cross-validation can be visualized over the range of FN_crit values considered. Visualizing the results
can inform the user about variability in the cross-validation process, ranges of potentially reasonable FN_crit values, etc. Graphical output depends on
whether many FN_crit and/or many alpha.test are evaluated, with line plots or heat plots alternately generated.
IMPORTANT: cvFPM is not in itself optimized for runtime - running cvFPM can take a long time
The which argument can be used to specify which of the metric-specific plots should be generated when plot = TRUE. Inputs
to which are, by default, c(1, 2, 3, 4).
Value
data.frame of metric output, base R graphical output
See Also
chemSig, FPM, optimFPM
Examples
paramList = c("Cd", "Cu", "Fe", "Mn", "Ni", "Pb", "Zn")
par(mfrow = c(2,2))
cvFPM(h.tristate, paramList, seq(0.1, 0.9, 0.1), 0.05)