R: Flexible sensitivity analysis via ranking / nearest...

sobolshap_knn {sensitivity}

R Documentation

Flexible sensitivity analysis via ranking / nearest neighbours

Description

WARNING: DEPRECATED function: use shapleysobol_knn instead. sobolshap_knn implements the estimation of several sensitivity indices using only N model evaluations via ranking (following Gamboa et al. (2020) and Chatterjee (2019)) or nearest neighbour search (Broto et al. (2020) and Azadkia & Chatterjee (2020)). It can be used with categorical inputs (which are transformed with one-hot encoding), dependent inputs and multiple outputs. Sensitivity indices of any group of inputs can be computed, which means that in particular first-order/total Sobol indices and Shapley effects are accessible. For large sample sizes, the nearest neightbour algorithm can be significantly accelerated by using approximate nearest neighbour search. It is also possible to estimate Shapley effects with the random permutation approach of Castro et al.(2009), where all the terms are obtained with ranking or nearest neighbours.

Usage

  sobolshap_knn(model = NULL, X, id.cat = NULL, U = NULL, method = "knn", n.knn = 2, 
                return.shap = FALSE, randperm = FALSE, n.perm = 1e4, 
                rescale = FALSE, n.limit = 2000, noise = FALSE, ...)
  ## S3 method for class 'sobolshap_knn'
tell(x, y = NULL, ...)
  ## S3 method for class 'sobolshap_knn'
extract(x, ...)
  ## S3 method for class 'sobolshap_knn'
print(x, ...)
  ## S3 method for class 'sobolshap_knn'
plot(x, ylim = c(0, 1), type.multout = "lines", ...)
  ## S3 method for class 'sobolshap_knn'
ggplot(data,  mapping = aes(), ylim = c(0, 1), type.multout =
                 "lines", ..., environment = parent.frame())

Arguments

`model`	a function, or a model with a `predict` method, defining the model to analyze.
`X`	a random sample of the inputs.
`id.cat`	a vector with the indices of the categorical inputs.
`U`	an integer equal to 0 (total Sobol indices) or 1 (first-order Sobol indices) or a list of vector indices defining the subsets of inputs whose sensitivity indices must be computed or a matrix of 0s and 1s where each row encodes a subset of inputs whose sensitivity indices must be computed (see examples) or NULL (all possible subsets).
`method`	the algorithm to be used for estimation, either "rank" or "knn", see details.
`n.knn`	the number of nearest neighbours used for estimation if `method="knn"`.
`return.shap`	a logical indicating if Shapley effects must be estimated, can only be TRUE if `U=NULL`.
`randperm`	a logical indicating if random permutations are used to estimate Shapley effects, only if `U=NULL` and `return.shap=TRUE`.
`n.perm`	the number of random permutations used for estimation if `randperm=TRUE`.
`rescale`	a logical indicating if continuous inputs must be rescaled before distance computations. If TRUE, continuous inputs are first whitened with the ZCA-cor whitening procedure (cf. whiten() function in package `whitening`). If the inputs are independent, this first step will have a very limited impact. Then, the resulting whitened inputs are individually modified via a copula transform such that each input has the same scale.
`n.limit`	the sample size limit above which approximate nearest neighbour search is activated, only used if `method="knn"`.
`noise`	a logical which is TRUE if the model or the output sample is noisy, see details.
`x`	a list of class `"sobolshap_knn"` storing the state of the sensitivity study (parameters, data, estimates).
`data`	a list of class `"sobolshap_knn"` storing the state of the sensitivity study (parameters, data, estimates).
`y`	a vector of model responses.
`ylim`	y-coordinate plotting limits.
`type.multout`	the plotting method in the case of multiple outputs, either "points" or "lines", see examples.
`mapping`	Default list of aesthetic mappings to use for plot. If not specified, must be supplied in each layer added to the plot.
`environment`	[Deprecated] Used prior to tidy evaluation.
`...`	any other arguments for `model` which are passed unchanged each time it is called.

Details

For method="rank", the estimator is defined in Gamboa et al. (2020) following Chatterjee (2019). For first-order indices it is based on an input ranking (same algorithm as in sobolrank) while for higher orders, it uses an approximate heuristic solution of the traveling salesman problem applied to the input sample distances (cf. TSP() function in package TSP). For method="knn", ranking and TSP are replaced by a nearest neighbour search as proposed in Broto et al. (2020) and in Azadkia & Chatterjee (2020) for a similar coefficient. The algorithm is the same as in shapleySubsetMc but with an optimized implementation. In particular, the distance used for subsets with mixed inputs (continuous and categorical) are the same but here the additional one-hot encoding of categorical variables makes it possible to work only with Euclidean distances. Furthermore, a fast approximate nearest neighbour search is also available, which is strongly recommended for large sample sizes. The main difference with shapleySubsetMc is that here we use the entire N sample to compute all indices, while in shapleySubsetMc the user can specify a total cost Ntot which performs a specific allocation of sample sizes to the estimation of each index. In addition, the weights option is not available here yet. If the outputs are noisy, the argument noise can be used: it only has an impact on the estimation of one specific sensitivity index, namely Var(E(Y|X1,\ldots,Xp))/Var(Y). If there is no noise this index is equal to 1, while in the presence of noise it must be estimated.

When randperm=TRUE, Shapley effects are no longer estimated by computing all the possible subsets of variables but only on subsets obtained with random permutations as proposed in Castro et al.(2009). This is useful for problems with a large number of inputs, since the number of subsets increases exponentially with dimension.

The extract method is useful if in a first step the Shapley effects have been computed and thus sensitivity indices for all possible subsets are available. The resulting sobolshap_knn object can be post-treated by extract to get first-order and total Sobol indices very easily.

Value

sobolshap_knn returns a list of class "sobolshap_knn", containing all the input arguments detailed before, plus the following components:

`call`	the matched call.
`X`	a `data.frame` containing the design of experiments.
`y`	a vector of model responses.
`U`	the subsets of inputs for which sensitivity indices have been computed.
`S`	the estimations of the Sobol sensitivity indices (see details).
`Shap`	the estimations of Shapley effects, if return.shap was set to TRUE.
`order`	0 (total indices), 1 (first-order indices) or NULL. Used for plotting defaults.

Author(s)

Sebastien Da Veiga

References

Azadkia M., Chatterjee S., 2021), A simple measure of conditional dependence, Ann. Statist. 49(6):3070-3102.

Broto B., Bachoc F., Depecker M. (2020), Variance reduction for estimation of Shapley effects and adaptation to unknown input distribution, SIAM/ASA Journal of Uncertainty Quantification, 8:693-716.

Castro J., Gomez D, Tejada J. (2009). Polynomial calculation of the Shapley value based on sampling. Computers & Operations Research, 36(5):1726-1730.

Chatterjee, S., 2021, A new coefficient of correlation, Journal of the American Statistical Association, 116:2009-2022.

Gamboa, F., Gremaud, P., Klein, T., & Lagnoux, A., 2022, Global Sensitivity Analysis: a novel generation of mighty estimators based on rank statistics, Bernoulli 28: 2345-2374.

Examples

  
    # Test case: the non-monotonic Sobol g-function
    # Example with a call to a numerical model
    # First compute first-order indices with ranking
    n <- 1000
    X <- data.frame(matrix(runif(8 * n), nrow = n))
    x <- sobolshap_knn(model = sobol.fun, X = X, U = 1, method = "rank")
    print(x)
    library(ggplot2)
    ggplot(x)
    # We can use the output sample generated for this estimation to compute total indices 
    # without additional calls to the model
    x2 <- sobolshap_knn(model = NULL, X = X, U = 0, method = "knn", n.knn = 5)
    tell(x2,x$y)
    ggplot(x2)
    
    # Test case: the Ishigami function
    # Example with given data and the use of approximate nearest neighbour search
    library(RANN)
    n <- 5000
    X <- data.frame(matrix(-pi+2*pi*runif(3 * n), nrow = n))
    Y <- ishigami.fun(X)
    x <- sobolshap_knn(model = NULL, X = X, U = NULL, method = "knn", n.knn = 5, 
                       return.shap = TRUE, n.limit = 2000)
    tell(x,Y)
    library(ggplot2)
    ggplot(x)
    # We can also extract first-order and total Sobol indices
    x1 <- extract(x)
    print(x1)
    
    # Test case : Linear model (3 Gaussian inputs including 2 dependent) with scaling
    # See Iooss and Prieur (2019)
    library(mvtnorm) # Multivariate Gaussian variables
    library(whitening) # For scaling
    modlin <- function(X) apply(X,1,sum)
    d <- 3
    n <- 10000
    mu <- rep(0,d)
    sig <- c(1,1,2)
    ro <- 0.9
    Cormat <- matrix(c(1,0,0,0,1,ro,0,ro,1),d,d)
    Covmat <- ( sig %*% t(sig) ) * Cormat
    Xall <- function(n) mvtnorm::rmvnorm(n,mu,Covmat)
    X <- Xall(n)
    x <- sobolshap_knn(model = modlin, X = X, U = NULL, method = "knn", n.knn = 5, 
                       return.shap = TRUE, rescale = TRUE, n.limit = 2000)
    print(x)
    
    # Test case: functional toy fct 'Arctangent temporal function'
    n <- 3000
    X <- data.frame(matrix(runif(2*n,-7,7), nrow = n))
    Y <- atantemp.fun(X)
    x <- sobolshap_knn(model = NULL, X = X, U = NULL, method = "knn", n.knn = 5, 
                       return.shap = TRUE, n.limit = 2000)
    tell(x,Y)
    library(ggplot2)
    library(reshape2)
    ggplot(x, type.multout="lines")

[Package sensitivity version 1.30.0 Index]