R: High Probability Lower Bounds (HPLB) for the Total Variation...

HPLB {HPLB}

R Documentation

High Probability Lower Bounds (HPLB) for the Total Variation Distance (TV) Based on Finite Samples

Description

Implementations of different HPLBs for TV as described in (Michel et al., 2020).

Usage

HPLB(
  t,
  rho,
  s = 0.5,
  estimator.type = "adapt",
  alpha = 0.05,
  tv.seq = seq(from = 0, to = 1, by = 1/length(t)),
  custom.bounding.seq = NULL,
  direction = rep("left", length(s)),
  cutoff = 0.5,
  verbose.plot = FALSE,
  seed = 0,
  ...
)

Arguments

`t`	a numeric vector value corresponding to a natural ordering of the observations. For a two-sample test 0-1 numeric values values should be provided.
`rho`	a numeric vector value providing an ordering. This could be a binary classifier, a regressor, a witness function from a MMD kernel or anything else that would witness a distributional difference.
`s`	a numeric vector value giving split points on t.
`estimator.type`	a character value indicating which estimator to use. One option out of: `adapt`:adaptive binary classification estimator (asymptotic bounding function) `bayes`:binary classification estimator `bayes_finite_sample`:binary classification finite sample estimator `adapt_empirical`:adaptive binary classification estimator (simulation-based bounding function) `adapt_custom`:adaptive binary classificatrion estimator (user-defined bounding function) `adapt_dwit`:adaptive binary classificatrion estimator (for distributional witnesses estimation)
`alpha`	a numeric value giving the overall type-I error control level.
`tv.seq`	a sequence of values between 0 and 1 used as the grid search for the total variation distance in case of tv-search.
`custom.bounding.seq`	a list of bounding functions respecting the order of tv.seq used in case of estimator.type "custom-tv-search".
`direction`	a character vector value made of "left" or "right" giving which distribution witness count to estimate (t<=s or t>s?).
`cutoff`	a numeric value. This is the cutoff used if bayes estimators are used. The theory suggests to use 1/2 but this can be changed.
`verbose.plot`	a boolean value for additional plots.
`seed`	an integer value. The seed for reproducibility.
`...`	additional parameters for the function `empiricalBF`.

Value

a list containing the relevant lower bounds estimates. For the total variation distance the relevant entry is tvhat.

Author(s)

Loris Michel, Jeffrey Naef

References

L. Michel, J. Naef and N. Meinshausen (2020). High-Probability Lower Bounds for the Total Variation Distance

Examples

## libs
library(HPLB)
library(ranger)
library(distrEx)

## reproducibility
set.seed(0)

## Example 1: TV lower bound based on two samples (bayes estimator), Gaussian mean-shift example

n <- 100
means <- rep(c(0,2), each = n / 2)
x <- stats::rnorm(n, mean = means)
t <- rep(c(0,1), each = n / 2)

bayesRate <- function(x) {
  return(stats::dnorm(x, mean = 2) /
    (stats::dnorm(x, mean = 2) + stats::dnorm(x, mean = 0)))
}

# estimated HPLB
tvhat <- HPLB(t = t, rho = bayesRate(x), estimator.type = "bayes")
# true TV
TotalVarDist(e1 = Norm(2,1), e2 = Norm(0,1))

## Example 2: optimal mixture detection (adapt estimator), Gaussian mean-shift example

n <- 100
mean.shift <- 2
t.train <- runif(n, 0 ,1)
x.train <- ifelse(t.train>0.5, stats::rnorm(n, mean.shift), stats::rnorm(n))
rf <- ranger::ranger(t~x, data.frame(t=t.train,x=x.train))

n <- 100
t.test <- runif(n, 0 ,1)
x.test <- ifelse(t.test>0.5, stats::rnorm(n, mean.shift), stats::rnorm(n))
rho <- predict(rf, data.frame(t=t.test,x=x.test))$predictions

## out-of-sample
tv.oos <- HPLB(t = t.test, rho = rho, s = seq(0.1,0.9,0.1), estimator.type = "adapt")


## total variation values
tv <- c()
for (s in seq(0.1,0.9,0.1)) {

 if (s<=0.5) {

   D.left <- Norm(0,1)
 } else {

   D.left <- UnivarMixingDistribution(Dlist = list(Norm(0,1),Norm(mean.shift,1)),
               mixCoeff = c(ifelse(s<=0.5, 1, 0.5/s), ifelse(s<=0.5, 0, (s-0.5)/s)))
 }
 if (s < 0.5) {

   D.right <- UnivarMixingDistribution(Dlist = list(Norm(0,1),Norm(mean.shift,1)),
               mixCoeff = c(ifelse(s<=0.5, (0.5-s)/(1-s), 0), ifelse(s<=0.5, (0.5/(1-s)), 1)))
 } else {

   D.right <- Norm(mean.shift,1)
 }
tv <- c(tv, TotalVarDist(e1 = D.left, e2 = D.right))
}

## plot
oldpar <- par(no.readonly =TRUE)
par(mfrow=c(2,1))
plot(t.test,x.test,pch=19,xlab="t",ylab="x")
plot(seq(0.1,0.9,0.1), tv.oos$tvhat,type="l",ylim=c(0,1),xlab="t", ylab="TV")
lines(seq(0.1,0.9,0.1), tv, col="red",type="l")
par(oldpar)

[Package HPLB version 1.0.0 Index]