R: Lineplot for LLO-adjusted Probability Predictions

lineplot {BRcal}

R Documentation

Lineplot for LLO-adjusted Probability Predictions

Description

Function to visualize how predicted probabilities change under MLE-recalibration and boldness-recalibration.

Usage

lineplot(
  x = NULL,
  y = NULL,
  t_levels = NULL,
  df = NULL,
  Pmc = 0.5,
  event = 1,
  return_df = FALSE,
  epsilon = .Machine$double.eps,
  title = "Line Plot",
  ylab = "Probability",
  xlab = "Posterior Model Probability",
  ylim = c(0, 1),
  breaks = seq(0, 1, by = 0.2),
  thin_to = NULL,
  thin_percent = NULL,
  thin_by = NULL,
  seed = 0,
  optim_options = NULL,
  nloptr_options = NULL,
  ggpoint_options = list(alpha = 0.35, size = 1.5, show.legend = FALSE),
  ggline_options = list(alpha = 0.25, linewidth = 0.5, show.legend = FALSE)
)

Arguments

`x`	a numeric vector of predicted probabilities of an event. Must only contain values in [0,1].
`y`	a vector of outcomes corresponding to probabilities in `x`. Must only contain two unique values (one for "events" and one for "non-events"). By default, this function expects a vector of 0s (non-events) and 1s (events).
`t_levels`	Vector of desired level(s) of calibration at which to plot contours.
`df`	Dataframe returned by previous call to lineplot() specially formatted for use in this function. Only used for faster plotting when making minor cosmetic changes to a previous call.
`Pmc`	The prior model probability for the calibrated model `M_c`.
`event`	Value in `y` that represents an "event". Default value is 1.
`return_df`	Logical. If `TRUE`, the dataframe used to build this plot will be returned.
`epsilon`	Amount by which probabilities are pushed away from 0 or 1 boundary for numerical stability. If a value in `x` < `epsilon`, it will be replaced with `epsilon`. If a value in `x` > `1-epsilon`, that value will be replaced with `1-epsilon`.
`title`	Plot title.
`ylab`	Label for x-axis.
`xlab`	Label for x-axis.
`ylim`	Vector with bounds for y-axis, must be in [0,1].
`breaks`	Locations along y-axis at which to draw horizontal guidelines, passed to `scale_y_continous()`.
`thin_to`	When non-null, the observations in (x,y) are randomly sampled without replacement to form a set of size `thin_to`.
`thin_percent`	When non-null, the observations in (x,y) are randomly sampled without replacement to form a set that is `thin_percent` * 100% of the original size of (x,y).
`thin_by`	When non-null, the observations in (x,y) are thinned by selecting every `thin_by` observation.
`seed`	Seed for random thinning. Set to NULL for no seed.
`optim_options`	List of additional arguments to be passed to optim().
`nloptr_options`	List with options to be passed to `nloptr()`.
`ggpoint_options`	List with options to be passed to `geom_point()`.
`ggline_options`	List with options to be passed to `geom_line()`.

Details

This function leverages ggplot() and related functions from the ggplot2 package (REF).

The goal of this function is to visualize how predicted probabilities change under different recalibration parameters. By default this function only shows how the original probabilities change after MLE recalibration. Argument t_levels can be used to specify a vector of levels of boldness-recalibration to visualize in addition to MLE recalibration.

While the x-axis shows the posterior model probabilities of each set of probabilities, note the posterior model probabilities are not in ascending or descending order. Instead, they simply follow the ordering of how one might use the BRcal package: first looking at the original predictions, then maximizing calibration, then examining how far they can spread out predictions while maintaining calibration with boldness-recalibration.

Value

If return_df = TRUE, a list with the following attributes is returned:

`plot`	A `ggplot` object showing how the predicted probabilities under MLE recalibration and specified levels of boldness-recalibration.
`df`	Dataframe used to create `plot`, specially formatted for use in `lineplot()`.

Otherwise just the ggplot object of the plot is returned.

Reusing underlying dataframe via `return_df`

While this function does not typically come with a large burden on time under moderate sample sizes, there is still a call to optim() under the hood for MLE recalibration and a call to nloptr() for each level of boldness-recalibration that could cause a bottleneck on time. With this in mind, users can specify return_df=TRUE to return the underlying dataframe used to build the resulting lineplot. Then, users can pass this dataframe to df in subsequent calls of lineplot to circumvent these calls to optim and nloptr and make cosmetic changes to the plot.

When return_df=TRUE, both the plot and the dataframe are returned in a list. The dataframe contains 6 columns:

probs: the values of each predicted probability under each set
outcome: the corresponding outcome for each predicted probability
post: the posterior model probability of the set as a whole
id: the id of each individual probability used for mapping observations between sets
set: the set with which the probability belongs to
label: the label used for the x-axis in the lineplot

Essentially, each set of probabilities (original, MLE-, and each level of boldness-recalibration) and outcomes are "stacked" on top of each other. The id tells the plotting function how to connect (with line) the same observation as is changes from the original set to MLE- or boldness-recalibration.

Thinning

Another strategy to save time when plotting is to thin the amount of data plotted. When sample sizes are large, the plot can become overcrowded and slow to plot. We provide three options for thinning: thin_to, thin_percent, and thin_by. By default, all three of these settings are set to NULL, meaning no thinning is performed. Users can only specify one thinning strategy at a time. Care should be taken in selecting a thinning approach based on the nature of your data and problem. Note that MLE recalibration and boldness-recalibration will be done using the full set.

Passing additional arguments to `geom_point()` and `geom_line()`

To make cosmetic changes to the points and lines plotted, users can pass a list of any desired arguments of geom_point() and geom_line() to ggpoint_options and ggline_options, respectively. These will overwrite everything passed to geom_point() or geom_line() except any aesthetic arguments in aes().

References

Guthrie, A. P., and Franck, C. T. (2024) Boldness-Recalibration for Binary Event Predictions, The American Statistician 1-17.

Wickham, H. (2016) ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York.

Examples


set.seed(28)
# Simulate 100 predicted probabilities
x <- runif(100)
# Simulated 100 binary event outcomes using x
y <- rbinom(100, 1, x)  # By construction, x is well calibrated.

# Lineplot show change in probabilities from original to MLE-recalibration to 
# specified Levels of Boldness-Recalibration via t_levels
# Return a list with dataframe used to construct plot with return_df=TRUE
lp1 <- lineplot(x, y, t_levels=c(0.98, 0.95), return_df=TRUE)
lp1$plot

# Reusing the previous dataframe to save calculation time
lineplot(df=lp1$df)

# Adjust geom_point cosmetics via ggpoint
# Increase point size and change to open circles
lineplot(df=lp1$df, ggpoint_options=list(size=3, shape=4))

# Adjust geom_line cosmetics via ggline
# Increase line size and change transparencys
lineplot(df=lp1$df, ggline_options=list(linewidth=2, alpha=0.1))

# Thinning down to 75 randomly selected observation
lineplot(df=lp1$df, thin_to=75)

# Thinning down to 53% of the data
lineplot(df=lp1$df, thin_percent=0.53)

# Thinning down to every 3rd observation
lineplot(df=lp1$df, thin_by=3)

# Setting a different seed for thinning
lineplot(df=lp1$df, thin_percent=0.53, seed=47)

# Setting NO seed for thinning (plot will be different every time)
lineplot(df=lp1$df, thin_to=75, seed=NULL)

[Package BRcal version 0.0.4 Index]