R: Explain the output of machine learning models with more...

explain {shapr}

R Documentation

Explain the output of machine learning models with more accurately estimated Shapley values

Description

Explain the output of machine learning models with more accurately estimated Shapley values

Usage

explain(x, explainer, approach, prediction_zero, ...)

## S3 method for class 'empirical'
explain(
  x,
  explainer,
  approach,
  prediction_zero,
  type = "fixed_sigma",
  fixed_sigma_vec = 0.1,
  n_samples_aicc = 1000,
  eval_max_aicc = 20,
  start_aicc = 0.1,
  w_threshold = 0.95,
  ...
)

## S3 method for class 'gaussian'
explain(
  x,
  explainer,
  approach,
  prediction_zero,
  mu = NULL,
  cov_mat = NULL,
  ...
)

## S3 method for class 'copula'
explain(x, explainer, approach, prediction_zero, ...)

## S3 method for class 'ctree'
explain(
  x,
  explainer,
  approach,
  prediction_zero,
  mincriterion = 0.95,
  minsplit = 20,
  minbucket = 7,
  sample = TRUE,
  ...
)

## S3 method for class 'combined'
explain(
  x,
  explainer,
  approach,
  prediction_zero,
  mu = NULL,
  cov_mat = NULL,
  ...
)

## S3 method for class 'ctree_comb_mincrit'
explain(x, explainer, approach, prediction_zero, mincriterion, ...)

Arguments

`x`	A matrix or data.frame. Contains the the features, whose predictions ought to be explained (test data).
`explainer`	An `explainer` object to use for explaining the observations. See `shapr`.
`approach`	Character vector of length `1` or `n_features`. `n_features` equals the total number of features in the model. All elements should either be `"gaussian"`, `"copula"`, `"empirical"`, or `"ctree"`. See details for more information.
`prediction_zero`	Numeric. The prediction value for unseen data, typically equal to the mean of the response.
`...`	Additional arguments passed to `prepare_data`
`type`	Character. Should be equal to either `"independence"`, `"fixed_sigma"`, `"AICc_each_k"` or `"AICc_full"`.
`fixed_sigma_vec`	Numeric. Represents the kernel bandwidth. Note that this argument is only applicable when `approach = "empirical"`, and `type = "fixed_sigma"`
`n_samples_aicc`	Positive integer. Number of samples to consider in AICc optimization. Note that this argument is only applicable when `approach = "empirical"`, and `type` is either equal to `"AICc_each_k"` or `"AICc_full"`
`eval_max_aicc`	Positive integer. Maximum number of iterations when optimizing the AICc. Note that this argument is only applicable when `approach = "empirical"`, and `type` is either equal to `"AICc_each_k"` or `"AICc_full"`
`start_aicc`	Numeric. Start value of `sigma` when optimizing the AICc. Note that this argument is only applicable when `approach = "empirical"`, and `type` is either equal to `"AICc_each_k"` or `"AICc_full"`
`w_threshold`	Positive integer between 0 and 1.
`mu`	Numeric vector. (Optional) Containing the mean of the data generating distribution. If `NULL` the expected values are estimated from the data. Note that this is only used when `approach = "gaussian"`.
`cov_mat`	Numeric matrix. (Optional) Containing the covariance matrix of the data generating distribution. `NULL` means it is estimated from the data if needed (in the Gaussian approach).
`mincriterion`	Numeric value or vector where length of vector is the number of features in model. Value is equal to 1 - alpha where alpha is the nominal level of the conditional independence tests. If it is a vector, this indicates which mincriterion to use when conditioning on various numbers of features.
`minsplit`	Numeric value. Equal to the value that the sum of the left and right daughter nodes need to exceed.
`minbucket`	Numeric value. Equal to the minimum sum of weights in a terminal node.
`sample`	Boolean. If TRUE, then the method always samples `n_samples` from the leaf (with replacement). If FALSE and the number of obs in the leaf is less than `n_samples`, the method will take all observations in the leaf. If FALSE and the number of obs in the leaf is more than `n_samples`, the method will sample `n_samples` (with replacement). This means that there will always be sampling in the leaf unless `sample` = FALSE AND the number of obs in the node is less than `n_samples`.

Details

The most important thing to notice is that shapr has implemented four different approaches for estimating the conditional distributions of the data, namely "empirical", "gaussian", "copula" and "ctree".

In addition, the user also has the option of combining the four approaches. E.g. if you're in a situation where you have trained a model the consists of 10 features, and you'd like to use the "gaussian" approach when you condition on a single feature, the "empirical" approach if you condition on 2-5 features, and "copula" version if you condition on more than 5 features this can be done by simply passing approach = c("gaussian", rep("empirical", 4), rep("copula", 5)). If "approach[i]" = "gaussian" it means that you'd like to use the "gaussian" approach when conditioning on i features.

Value

Object of class c("shapr", "list"). Contains the following items:

dt: data.table
model: Model object
p: Numeric vector
x_test: data.table

Note that the returned items model, p and x_test are mostly added due to the implementation of plot.shapr. If you only want to look at the numerical results it is sufficient to focus on dt. dt is a data.table where the number of rows equals the number of observations you'd like to explain, and the number of columns equals m +1, where m equals the total number of features in your model.

If dt[i, j + 1] > 0 it indicates that the j-th feature increased the prediction for the i-th observation. Likewise, if dt[i, j + 1] < 0 it indicates that the j-th feature decreased the prediction for the i-th observation. The magnitude of the value is also important to notice. E.g. if dt[i, k + 1] and dt[i, j + 1] are greater than 0, where j != k, and dt[i, k + 1] > dt[i, j + 1] this indicates that feature j and k both increased the value of the prediction, but that the effect of the k-th feature was larger than the j-th feature.

The first column in dt, called 'none', is the prediction value not assigned to any of the features (\phi₀). It's equal for all observations and set by the user through the argument prediction_zero. In theory this value should be the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable.

Author(s)

Camilla Lingjaerde, Nikolai Sellereite, Martin Jullum, Annabelle Redelmeier

Examples

if (requireNamespace("MASS", quietly = TRUE)) {
  # Load example data
  data("Boston", package = "MASS")

  # Split data into test- and training data
  x_train <- head(Boston, -3)
  x_test <- tail(Boston, 3)

  # Fit a linear model
  model <- lm(medv ~ lstat + rm + dis + indus, data = x_train)

  # Create an explainer object
  explainer <- shapr(x_train, model)

  # Explain predictions
  p <- mean(x_train$medv)

  # Empirical approach
  explain1 <- explain(x_test, explainer,
    approach = "empirical",
    prediction_zero = p, n_samples = 1e2
  )

  # Gaussian approach
  explain2 <- explain(x_test, explainer,
    approach = "gaussian",
    prediction_zero = p, n_samples = 1e2
  )

  # Gaussian copula approach
  explain3 <- explain(x_test, explainer,
    approach = "copula",
    prediction_zero = p, n_samples = 1e2
  )

  # ctree approach
  explain4 <- explain(x_test, explainer,
    approach = "ctree",
    prediction_zero = p
  )

  # Combined approach
  approach <- c("gaussian", "gaussian", "empirical", "empirical")
  explain5 <- explain(x_test, explainer,
    approach = approach,
    prediction_zero = p, n_samples = 1e2
  )

  # Print the Shapley values
  print(explain1$dt)

  # Plot the results
  if (requireNamespace("ggplot2", quietly = TRUE)) {
    plot(explain1)
  }
}

[Package shapr version 0.2.2 Index]