cutpointr {cutpointr}R Documentation

Determine and evaluate optimal cutpoints

Description

Using predictions (or e.g. biological marker values) and binary class labels, this function will determine "optimal" cutpoints using various selectable methods. The methods for cutpoint determination can be evaluated using bootstrapping. An estimate of the cutpoint variability and the out-of-sample performance can then be returned with summary or plot. For an introduction to the package please see vignette("cutpointr", package = "cutpointr").

Usage

cutpointr(...)

## Default S3 method:
cutpointr(
  data,
  x,
  class,
  subgroup = NULL,
  method = maximize_metric,
  metric = sum_sens_spec,
  pos_class = NULL,
  neg_class = NULL,
  direction = NULL,
  boot_runs = 0,
  boot_stratify = FALSE,
  use_midpoints = FALSE,
  break_ties = median,
  na.rm = FALSE,
  allowParallel = FALSE,
  silent = FALSE,
  tol_metric = 1e-06,
  ...
)

## S3 method for class 'numeric'
cutpointr(
  x,
  class,
  subgroup = NULL,
  method = maximize_metric,
  metric = sum_sens_spec,
  pos_class = NULL,
  neg_class = NULL,
  direction = NULL,
  boot_runs = 0,
  boot_stratify = FALSE,
  use_midpoints = FALSE,
  break_ties = median,
  na.rm = FALSE,
  allowParallel = FALSE,
  silent = FALSE,
  tol_metric = 1e-06,
  ...
)

Arguments

...

Further optional arguments that will be passed to method. minimize_metric and maximize_metric pass ... to metric.

data

A data.frame with the data needed for x, class and optionally subgroup.

x

The variable name to be used for classification, e.g. predictions. The raw vector of values if the data argument is unused.

class

The variable name indicating class membership. If the data argument is unused, the vector of raw numeric values.

subgroup

An additional covariate that identifies subgroups or the raw data if data = NULL. Separate optimal cutpoints will be determined per group. Numeric, character and factor are allowed.

method

(function) A function for determining cutpoints. Can be user supplied or use some of the built in methods. See details.

metric

(function) The function for computing a metric when using maximize_metric or minimize_metric as method and and for the out-of-bag values during bootstrapping. A way of internally validating the performance. User defined functions can be supplied, see details.

pos_class

(optional) The value of class that indicates the positive class.

neg_class

(optional) The value of class that indicates the negative class.

direction

(character, optional) Use ">=" or "<=" to indicate whether x is supposed to be larger or smaller for the positive class.

boot_runs

(numerical) If positive, this number of bootstrap samples will be used to assess the variability and the out-of-sample performance.

boot_stratify

(logical) If the bootstrap is stratified, bootstrap samples are drawn separately in both classes and then combined, keeping the proportion of positives and negatives constant in every resample.

use_midpoints

(logical) If TRUE (default FALSE) the returned optimal cutpoint will be the mean of the optimal cutpoint and the next highest observation (for direction = ">=") or the next lowest observation (for direction = "<=") which avoids biasing the optimal cutpoint.

break_ties

If multiple cutpoints are found, they can be summarized using this function, e.g. mean or median. To return all cutpoints use c as the function.

na.rm

(logical) Set to TRUE (default FALSE) to keep only complete cases of x, class and subgroup (if specified). Missing values with na.rm = FALSE will raise an error.

allowParallel

(logical) If TRUE, the bootstrapping will be parallelized using foreach. A local cluster, for example, should be started manually beforehand.

silent

(logical) If TRUE suppresses all messages.

tol_metric

All cutpoints will be returned that lead to a metric value in the interval [m_max - tol_metric, m_max + tol_metric] where m_max is the maximum achievable metric value. This can be used to return multiple decent cutpoints and to avoid floating-point problems. Not supported by all method functions, see details.

Details

If direction and/or pos_class and neg_class are not given, the function will assume that higher values indicate the positive class and use the class with a higher median as the positive class.

This function uses tidyeval to support unquoted arguments. For programming with cutpointr the operator !! can be used to unquote an argument, see the examples.

Different methods can be selected for determining the optimal cutpoint via the method argument. The package includes the following method functions:

User-defined functions can be supplied to method, too. As a reference, the code of all included method functions can be accessed by simply typing their name. To define a new method function, create a function that may take as input(s):

The ... argument can be used to avoid an error if not all of the above arguments are needed or in order to pass additional arguments to method. The function should return a data.frame or tbl_df with one row, the column "optimal_cutpoint", and an optional column with an arbitrary name with the metric value at the optimal cutpoint.

Built-in metric functions include:

Furthermore, the following functions are included which can be used as metric functions but are more useful for plotting purposes, for example in plot_cutpointr, or for defining new metric functions: tp, fp, tn, fn, tpr, fpr, tnr, fnr, false_omission_rate, false_discovery_rate, ppv, npv, precision, recall, sensitivity, and specificity.

User defined metric functions can be created as well which can accept the following inputs as vectors:

The function should return a numeric vector or a matrix or a data.frame with one column. If the column is named, the name will be included in the output and plots. Avoid using names that are identical to the column names that are by default returned by cutpointr.

If boot_runs is positive, that number of bootstrap samples will be drawn and the optimal cutpoint using method will be determined. Additionally, as a way of internal validation, the function in metric will be used to score the out-of-bag predictions using the cutpoints determined by method. Various default metrics are always included in the bootstrap results.

If multiple optimal cutpoints are found, the column optimal_cutpoint becomes a list that contains the vector(s) of the optimal cutpoints.

If use_midpoints = TRUE the mean of the optimal cutpoint and the next highest or lowest possible cutpoint is returned, depending on direction.

The tol_metric argument can be used to avoid floating-point problems that may lead to exclusion of cutpoints that achieve the optimally achievable metric value. Additionally, by selecting a large tolerance multiple cutpoints can be returned that lead to decent metric values in the vicinity of the optimal metric value. tol_metric is passed to metric and is only supported by the maximization and minimization functions, i.e. maximize_metric, minimize_metric, maximize_loess_metric, minimize_loess_metric, maximize_spline_metric, and minimize_spline_metric. In maximize_boot_metric and minimize_boot_metric multiple optimal cutpoints will be passed to the summary_func of these two functions.

Value

A cutpointr object which is also a data.frame and tbl_df.

See Also

Other main cutpointr functions: add_metric(), boot_ci(), boot_test(), multi_cutpointr(), predict.cutpointr(), roc()

Examples

library(cutpointr)

## Optimal cutpoint for dsi
data(suicide)
opt_cut <- cutpointr(suicide, dsi, suicide)
opt_cut
s_opt_cut <- summary(opt_cut)
plot(opt_cut)

## Not run: 
## Predict class for new observations
predict(opt_cut, newdata = data.frame(dsi = 0:5))

## Supplying raw data, same result
cutpointr(x = suicide$dsi, class = suicide$suicide)

## direction, class labels, method and metric can be defined manually
## Again, same result
cutpointr(suicide, dsi, suicide, direction = ">=", pos_class = "yes",
          method = maximize_metric, metric = youden)

## Optimal cutpoint for dsi, as before, but for the separate subgroups
opt_cut <- cutpointr(suicide, dsi, suicide, gender)
opt_cut
(s_opt_cut <- summary(opt_cut))
tibble:::print.tbl(s_opt_cut)

## Bootstrapping also works on individual subgroups
set.seed(30)
opt_cut <- cutpointr(suicide, dsi, suicide, gender, boot_runs = 1000,
  boot_stratify = TRUE)
opt_cut
summary(opt_cut)
plot(opt_cut)

## Parallelized bootstrapping
  library(doParallel)
  library(doRNG)
  cl <- makeCluster(2) # 2 cores
  registerDoParallel(cl)
  registerDoRNG(12) # Reproducible parallel loops using doRNG
  opt_cut <- cutpointr(suicide, dsi, suicide, gender,
                       boot_runs = 1000, allowParallel = TRUE)
  stopCluster(cl)
  opt_cut
  plot(opt_cut)

## Robust cutpoint method using kernel smoothing for optimizing Youden-Index
opt_cut <- cutpointr(suicide, dsi, suicide, gender,
                     method = oc_youden_kernel)
opt_cut

## End(Not run)




[Package cutpointr version 1.1.2 Index]