cvFPM {RFPM}R Documentation

Cross-Validation Optimization of the Floating Percentile Model

Description

Use leave-one-out (LOO) or k-folds cross-validation methods to calculate parameter inputs that optimize benchmark performance while attempting to account for out-of-sample uncertainty

Usage

cvFPM(
  data,
  paramList,
  FN_crit = seq(0.1, 0.9, by = 0.05),
  alpha.test = seq(0.05, 0.5, by = 0.05),
  k = 5,
  seed = 1,
  plot = TRUE,
  which = c(1, 2, 3, 4),
  colors = heat.colors(10),
  colsteps = 100,
  ...
)

Arguments

data

data.frame containing, at a minimum, chemical concentrations as columns and a logical Hit column classifying toxicity

paramList

character vector of column names of chemical concentration variables in data

FN_crit

numeric vector of values between 0 and 1 indicating false negative threshold for floating percentile model benchmark selection (default = seq(0.1, 0.9, 0.05))

alpha.test

numeric vector of values between 0 and 1 indicating type-I error rate for chemical selection (default = seq(0.05, 0.5, by = 0.05))

k

integer with length = 1 and value > 1 indicating how many folds to include in k-folds type cross-validation method (default = 5)

seed

integer with length = 1 indicating the random seed to set for assigning k classes for k-folds cross-validation (default = 1)

plot

whether to plot the output of cvFPM (default = TRUE)

which

numeric or character indicating which type of plot to generate (see Details; default = c(1, 2, 3, 4))

colors

values recognizible as colors to be passed to colorRampPalette (via colorGradient) to generate a palette for plotting (default = heat.colors(10))

colsteps

integer; number of discrete steps to interpolate colors in colorGradient (default = 100)

...

additional arguments passed to chemSig and FPM

Details

cvFPM allows users to "tune" the FN_crit and alpha.test arguments used by FPM (via chemSig). This is achieved by splitting the empirical dataset into "test" and "training" subsets, calculating benchmarks for the training set, and then evaluating the benchmarks' ability to predict Hits in the out-of-sample test set. The output of cvFPM is similar to optimFPM: optimal FN_crit and alpha.test inputs based on several classification metrics (see ?optimFPM for more details). The key difference between cvFPM and optimFPM is that cvFPM attempts to account for out-of-sample uncertainty, whereas optimFPM is specific (and potentially overly specific) to the full FPM dataset. Because the primary use of FPM SQBs will be to predict toxicity in sediment samples where toxicity is not measured, the FPM should be parameterized in a way that best accounts for out-of-sample uncertainty. In other words, while FPM generates classification metrics like "overall reliability" for SQBs, they are unlikely to achieve the expected reliability when applied to new samples. This is an inherent limitation of SQBs, which the FPM cannot fully address but that cvFPM considers.

Two cross-validation methods are available, controlled through the k argument. If the user specifies k = NULL or k = nrow(data), then leave-one-out (LOO) is used. LOO is computationally intensive but is better suited to small datasets. Other values of k (e.g., the default k = 5) will result in applying a k-folds cross-validation method, which uses larger test subsets (and smaller training sets), evaluates fewer scenarios, and greatly improves runtime for large datasets. The seed argument can be used to establish a consistent result; if is.null(seed), the result will vary based on randomization. Allowing for randomization may be desireable to understand between-run variability in cvFPM output caused by re-sampling of training/test sets.

By setting plot = TRUE (the default), the outcome of cross-validation can be visualized over the range of FN_crit values considered. Visualizing the results can inform the user about variability in the cross-validation process, ranges of potentially reasonable FN_crit values, etc. Graphical output depends on whether many FN_crit and/or many alpha.test are evaluated, with line plots or heat plots alternately generated.

IMPORTANT: cvFPM is not in itself optimized for runtime - running cvFPM can take a long time

The which argument can be used to specify which of the metric-specific plots should be generated when plot = TRUE. Inputs to which are, by default, c(1, 2, 3, 4).

Value

data.frame of metric output, base R graphical output

See Also

chemSig, FPM, optimFPM

Examples

paramList = c("Cd", "Cu", "Fe", "Mn", "Ni", "Pb", "Zn")
par(mfrow = c(2,2))
cvFPM(h.tristate, paramList, seq(0.1, 0.9, 0.1), 0.05)

[Package RFPM version 1.1 Index]