ci_mincost {OptHoldoutSize}R Documentation

Confidence interval for minimum total cost, when estimated using parametric method

Description

Compute confidence interval for cost at optimal holdout size given either a standard error covariance matrix or a set of m estimates of parameters.

This can be done either asymptotically, using a method analogous to the Fisher information matrix, or empirically (using bootstrap resampling)

If sigma (covariance matrix) is specified and method='bootstrap', a confidence interval is generated assuming a Gaussian distribution of (N,k1,theta). To estimate a confidence interval assuming a non-Gaussian distribution, simulate values under the requisite distribution and use then as parameters N,k1, theta, with sigma set to NULL.

Usage

ci_mincost(
  N,
  k1,
  theta,
  alpha = 0.05,
  k2form = powerlaw,
  grad_mincost = NULL,
  sigma = NULL,
  n_boot = 1000,
  seed = NULL,
  mode = "empirical",
  ...
)

Arguments

N

Vector of estimates of total number of samples on which the predictive score will be used/fitted, or single estimate

k1

Vector of estimates of cost value in the absence of a predictive score, or single number

theta

Matrix of estimates of parameters for function k2form(n) governing expected cost to an individual sample given a predictive score fitted to n samples. Can be a matrix of dimension n x n_par, where n_par is the number of parameters of k2.

alpha

Construct 1-alpha confidence interval. Defaults to 0.05

k2form

Function governing expected cost to an individual sample given a predictive score fitted to n samples. Must take two arguments: n (number of samples) and theta (parameters). Defaults to a power-law form ⁠k2(n,c(a,b,c))=a n^(-b) + c⁠.

grad_mincost

Function giving partial derivatives of minimum cost, taking three arguments: N, k1, and theta. Only used for asymptotic confidence intervals. F NULL, estimated empirically

sigma

Standard error covariance matrix for (N,k1,theta), in that order. If NULL, will derive as sample covariance matrix of parameters. Must be of the correct size and positive definite.

n_boot

Number of bootstrap resamples for empirical estimate.

seed

Random seed for bootstrap resamples. Defaults to NULL.

mode

One of 'asymptotic' or 'empirical'. Defaults to 'empirical'

...

Passed to function optimize

Value

A vector of length two containing lower and upper limits of confidence interval.

Examples


## Set seed
set.seed(574635)

## We will assume that our observations of N, k1, and theta=(a,b,c) are
##  distributed with mean mu_par and variance sigma_par
mu_par=c(N=10000,k1=0.35,A=3,B=1.5,C=0.1)
sigma_par=cbind(
  c(100^2,       1,      0,       0,       0),
  c(    1,  0.07^2,      0,       0,       0),
  c(    0,       0,  0.5^2,    0.05,  -0.001),
  c(    0,       0,   0.05,   0.4^2,  -0.002),
  c(    0,       0, -0.001,  -0.002,  0.02^2)
)

# Firstly, we make 500 observations
par_obs=rmnorm(500,mean=mu_par,varcov=sigma_par)

# Minimum cost and asymptotic and empirical confidence intervals
mincost=optimal_holdout_size(N=mean(par_obs[,1]),k1=mean(par_obs[,2]),
  theta=colMeans(par_obs[,3:5]))$cost
ci_a=ci_mincost(N=par_obs[,1],k1=par_obs[,2],theta=par_obs[,3:5],alpha=0.05,
  seed=12345,mode="asymptotic")
ci_e=ci_mincost(N=par_obs[,1],k1=par_obs[,2],theta=par_obs[,3:5],alpha=0.05,
  seed=12345,mode="empirical")


# Assess cover at various m
m_values=c(20,30,50,100,150,200,300,500,750,1000,1500)
ntrial=5000
alpha_trial=0.1 # use 90% confidence intervals
mincost_true=optimal_holdout_size(N=mu_par[1],k1=mu_par[2],
  theta=mu_par[3:5])$cost

## The matrices indicating cover take are included in this package but take
##  around 30 minutes to generate. They are generated using the code below
##  (including random seeds).
data(ci_cover_cost_a_yn)
data(ci_cover_cost_e_yn)

if (!exists("ci_cover_cost_a_yn")) {
  ci_cover_cost_a_yn=matrix(NA,length(m_values),ntrial) # Entry [i,j] is 1
  #  if ith asymptotic CI for jth value of m covers true mincost
  ci_cover_cost_e_yn=matrix(NA,length(m_values),ntrial) # Entry [i,j] is 1
  #  if ith empirical CI for jth value of m covers true mincost

  for (i in 1:length(m_values)) {
    m=m_values[i]
    for (j in 1:ntrial) {
      # Set seed
      set.seed(j*ntrial + i + 12345)

      # Make m observations
      par_obs=rmnorm(m,mean=mu_par,varcov=sigma_par)
      ci_a=ci_mincost(N=par_obs[,1],k1=par_obs[,2],theta=par_obs[,3:5],
        alpha=alpha_trial,mode="asymptotic")
      ci_e=ci_mincost(N=par_obs[,1],k1=par_obs[,2],theta=par_obs[,3:5],
        alpha=alpha_trial,mode="empirical",n_boot=500)

      if (mincost_true>ci_a[1] & mincost_true<ci_a[2])
        ci_cover_cost_a_yn[i,j]=1 else ci_cover_cost_a_yn[i,j]=0
      if (mincost_true>ci_e[1] & mincost_true<ci_e[2])
        ci_cover_cost_e_yn[i,j]=1 else ci_cover_cost_e_yn[i,j]=0
    }
    print(paste0("Completed for m = ",m))
  }

save(ci_cover_cost_a_yn,file="data/ci_cover_cost_a_yn.RData")
save(ci_cover_cost_e_yn,file="data/ci_cover_cost_e_yn.RData")

}

# Cover at each m value and standard error
cover_a=rowMeans(ci_cover_cost_a_yn)
cover_e=rowMeans(ci_cover_cost_e_yn)
zse_a=2*sqrt(cover_a*(1-cover_a)/ntrial)
zse_e=2*sqrt(cover_e*(1-cover_e)/ntrial)


# Draw plot. Convergence to 1-alpha cover is evident. Cover is not far from
#   alpha even at small m.

plot(0,type="n",xlim=range(m_values),ylim=c(0.7,1),xlab=expression("m"),
  ylab="Cover")

# Asymptotic cover and 2*SE pointwise envelope
polygon(c(m_values,rev(m_values)),c(cover_a+zse_a,rev(cover_a-zse_a)),
  col=rgb(0,0,0,alpha=0.3),border=NA)
lines(m_values,cover_a,col="black")

# Empirical cover and 2*SE pointwiseenvelope
polygon(c(m_values,rev(m_values)),c(cover_e+zse_e,rev(cover_e-zse_e)),
  col=rgb(0,0,1,alpha=0.3),border=NA)
lines(m_values,cover_e,col="blue")

abline(h=1-alpha_trial,col="red")
legend("bottomright",c("Asym.","Emp.",expression(paste("1-",alpha))),lty=1,
  col=c("black","blue","red"))


[Package OptHoldoutSize version 0.1.0.0 Index]