create_p_funs {ale}R Documentation

Create a p-value functions object that can be used to generate p-values

Description

Calculating p-values is not trivial for ALE statistics because ALE is non-parametric and model-agnostic. Because ALE is non-parametric (that is, it does not assume any particular distribution of data), the ale package generates p-values by calculating ALE for many random variables; this makes the procedure somewhat slow. For this reason, they are not calculated by default; they must be explicitly requested. Because the ale package is model-agnostic (that is, it works with any kind of R model), the ale() function cannot always automatically manipulate the model object to create the p-values. It can only do so for models that follow the standard R statistical modelling conventions, which includes almost all built-in R algorithms (like stats::lm() and stats::glm()) and many widely used statistics packages (like mgcv and survival), but which excludes most machine learning algorithms (like tidymodels and caret). For non-standard algorithms, the user needs to do a little work to help the ale function correctly manipulate its model object:

See the example below for how this is implemented.

Usage

create_p_funs(
  data,
  model,
  ...,
  parallel = parallel::detectCores(logical = FALSE) - 1,
  model_packages = as.character(NA),
  random_model_call_string = NULL,
  random_model_call_string_vars = character(),
  y_col = NULL,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  rand_it = 1000,
  silent = FALSE,
  .testing_mode = FALSE
)

Arguments

data

See documentation for ale()

model

See documentation for ale()

...

not used. Inserted to require explicit naming of subsequent arguments.

parallel

See documentation for ale()

model_packages

See documentation for ale()

random_model_call_string

character string. If NULL, create_p_funs() tries to automatically detect and construct the call for p-values. If it cannot, the function will fail early. In that case, a character string of the full call for the model must be provided that includes the random variable. See details.

random_model_call_string_vars

See documentation for model_call_string_vars in model_bootstrap(); the operation is very similar.

y_col

See documentation for ale()

pred_fun, pred_type

See documentation for ale().

rand_it

non-negative integer length 1. Number of times that the model should be retrained with a new random variable. The default of 1000 should give reasonably stable p-values. It can be reduced as low as 100 for faster test runs.

silent

See documentation for ale()

.testing_mode

logical. Internal use only.

Value

The return value is a list of class ⁠c('p_funs', 'ale', 'list'⁠) with an ale_version attribute whose value is the version of the ale package used to create the object. See examples for an illustration of how to inspect this list. Its elements are:

Approach to calculating p-values

The ale package takes a literal frequentist approach to the calculation of p-values. That is, it literally retrains the model 1000 times, each time modifying it by adding a distinct random variable to the model. (The number of iterations is customizable with the rand_it argument.) The ALEs and ALE statistics are calculated for each random variable. The percentiles of the distribution of these random-variable ALEs are then used to determine p-values for non-random variables. Thus, p-values are interpreted as the frequency of random variable ALE statistics that exceed the value of ALE statistic of the actual variable in question. The specific steps are as follows:

References

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.

Examples


# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]

# Create a GAM model with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
  price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
    cut + color + clarity,
  data = diamonds_sample
)
summary(gam_diamonds)

# Create p-value functions
pf_diamonds <- create_p_funs(
  diamonds_sample,
  gam_diamonds,
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100,
)

# Examine the structure of the returned object
str(pf_diamonds)
# In RStudio: View(pf_diamonds)

# Calculate ALEs with p-values
ale_gam_diamonds <- ale(
  diamonds_sample,
  gam_diamonds,
  p_values = pf_diamonds
)

# Plot the ALE data. The horizontal bands in the plots use the p-values.
ale_gam_diamonds$plots |>
  patchwork::wrap_plots()


# For non-standard models that give errors with the default settings,
# you can use 'random_model_call_string' to specify a model for the estimation
# of p-values from random variables as in this example.
# See details above for an explanation.
pf_diamonds <- create_p_funs(
  diamonds_sample,
  gam_diamonds,
  random_model_call_string = 'mgcv::gam(
    price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
        cut + color + clarity + random_variable,
    data = rand_data
  )',
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100,
)





[Package ale version 0.3.0 Index]