predIntNormSimultaneousTestPower {EnvStats}R Documentation

Probability That at Least One Set of Future Observations Violates the Given Rule Based on a Simultaneous Prediction Interval for a Normal Distribution

Description

Compute the probability that at least one set of future observations violates the given rule based on a simultaneous prediction interval for the next rr future sampling occasions for a normal distribution. The three possible rules are: kk-of-mm, California, or Modified California.

Usage

  predIntNormSimultaneousTestPower(n, df = n - 1, n.mean = 1, k = 1, m = 2, r = 1, 
    rule = "k.of.m", delta.over.sigma = 0, pi.type = "upper", conf.level = 0.95, 
    r.shifted = r, K.tol = .Machine$double.eps^0.5, integrate.args.list = NULL)

Arguments

n

vector of positive integers greater than 2 indicating the sample size upon which the prediction interval is based.

df

vector of positive integers indicating the degrees of freedom associated with the sample size. The default value is df=n-1.

n.mean

positive integer specifying the sample size associated with the future averages. The default value is n.mean=1 (i.e., individual observations). Note that all future averages must be based on the same sample size.

k

for the kk-of-mm rule (rule="k.of.m"), vector of positive integers specifying the minimum number of observations (or averages) out of mm observations (or averages) (all obtained on one future sampling “occassion”) the prediction interval should contain with confidence level conf.level. The default value is k=1. This argument is ignored when the argument rule is not equal to "k.of.m".

m

vector of positive integers specifying the maximum number of future observations (or averages) on one future sampling “occasion”. The default value is m=2, except when rule="Modified.CA", in which case this argument is ignored and m is automatically set equal to 4.

r

vector of positive integers specifying the number of future sampling “occasions”. The default value is r=1.

rule

character string specifying which rule to use. The possible values are "k.of.m" (kk-of-mm rule; the default), "CA" (California rule), and "Modified.CA" (modified California rule). See the DETAILS section below for more information.

delta.over.sigma

numeric vector indicating the ratio Δ/σ\Delta/\sigma. The quantity Δ\Delta (delta) denotes the difference between the mean of the population that was sampled to construct the prediction interval, and the mean of the population that will be sampled to produce the future observations. The quantity σ\sigma (sigma) denotes the population standard deviation for both populations. See the DETAILS section below for more information. The default value is delta.over.sigma=0.

pi.type

character string indicating what kind of prediction interval to compute. The possible values are pi.type="upper" (the default), and pi.type="lower".

conf.level

vector of values between 0 and 1 indicating the confidence level of the prediction interval. The default value is conf.level=0.95.

r.shifted

vector of positive integers specifying the number of future sampling occasions for which the scaled mean is shifted by Δ/σ\Delta/\sigma. All values must be integeters between 1 and the corresponding element of r. The default value is r.shifted=r.

K.tol

numeric scalar indicating the tolerance to use in the nonlinear search algorithm to compute KK. The default value is K.tol=.Machine$double.eps^(1/2). For many applications, the value of KK needs to be known only to the second decimal place, in which case setting K.tol=1e-4 will speed up computation a bit.

integrate.args.list

a list of arguments to supply to the integrate function. The default value is integrate.args.list=NULL which means that the default values of integrate are used.

Details

What is a Simultaneous Prediction Interval?
A prediction interval for some population is an interval on the real line constructed so that it will contain kk future observations from that population with some specified probability (1α)100%(1-\alpha)100\%, where 0<α<10 < \alpha < 1 and kk is some pre-specified positive integer. The quantity (1α)100%(1-\alpha)100\% is called the confidence coefficient or confidence level associated with the prediction interval. The function predIntNorm computes a standard prediction interval based on a sample from a normal distribution.

The function predIntNormSimultaneous computes a simultaneous prediction interval that will contain a certain number of future observations with probability (1α)100%(1-\alpha)100\% for each of rr future sampling “occasions”, where rr is some pre-specified positive integer. The quantity rr may refer to rr distinct future sampling occasions in time, or it may for example refer to sampling at rr distinct locations on one future sampling occasion, assuming that the population standard deviation is the same at all of the rr distinct locations.

The function predIntNormSimultaneous computes a simultaneous prediction interval based on one of three possible rules:

Simultaneous prediction intervals can be extended to using averages (means) in place of single observations (USEPA, 2009, Chapter 19). That is, you can create a simultaneous prediction interval that will contain a specified number of averages (based on which rule you choose) on each of rr future sampling occassions, where each each average is based on ww individual observations. For the function predIntNormSimultaneous, the argument n.mean corresponds to ww.

The Form of a Prediction Interval
Let x=x1,x2,,xn\underline{x} = x_1, x_2, \ldots, x_n denote a vector of nn observations from a normal distribution with parameters mean=μ\mu and sd=σ\sigma. Also, let ww denote the sample size associated with the future averages (i.e., n.mean=ww). When w=1w=1, each average is really just a single observation, so in the rest of this help file the term “averages” will replace the phrase “observations or averages”.

For a normal distribution, the form of a two-sided (1α)100%(1-\alpha)100\% prediction interval is:

[xˉKs,xˉ+Ks]            (1)[\bar{x} - Ks, \bar{x} + Ks] \;\;\;\;\;\; (1)

where xˉ\bar{x} denotes the sample mean:

xˉ=1ni=1nxi            (2)\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \;\;\;\;\;\; (2)

ss denotes the sample standard deviation:

s2=1n1i=1n(xixˉ)2            (3)s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \;\;\;\;\;\; (3)

and KK denotes a constant that depends on the sample size nn, the confidence level, the number of future sampling occassions rr, and the sample size associated with the future averages, ww. Do not confuse the constant KK (uppercase K) with the number of future averages kk (lowercase k) in the kk-of-mm rule. The symbol KK is used here to be consistent with the notation used for tolerance intervals (see tolIntNorm).

Similarly, the form of a one-sided lower prediction interval is:

[xˉKs,]            (4)[\bar{x} - Ks, \infty] \;\;\;\;\;\; (4)

and the form of a one-sided upper prediction interval is:

[,xˉ+Ks]            (5)[-\infty, \bar{x} + Ks] \;\;\;\;\;\; (5)

Note: For simultaneous prediction intervals, only lower (pi.type="lower") and upper
(pi.type="upper") prediction intervals are available.

The derivation of the constant KK is explained in the help file for predIntNormSimultaneousK.

Computing Power
The "power" of the prediction interval is defined as the probability that at least one set of future observations violates the given rule based on a simultaneous prediction interval for the next rr future sampling occasions, where the population mean for the future observations is allowed to differ from the population mean for the observations used to construct the prediction interval.

The quantity Δ\Delta (upper case delta) denotes the difference between the mean of the population that was sampled to construct the prediction interval, and the mean of the population that will be sampled to produce the future observations. The quantity σ\sigma (sigma) denotes the population standard deviation of both of these populations. The argument delta.over.sigma corresponds to the quantity Δ/σ\Delta/\sigma.

Power Based on the k-of-m Rule (rule="k.of.m")
For the kk-of-mm rule (rule="k.of.m") with w=1w=1 (i.e., n.mean=1), at least kk of the next mm future observations will fall in the prediction interval with probability (1α)100%(1-\alpha)100\% on each of the rr future sampling occasions. If observations are being taken sequentially, for a particular sampling occasion, up to mm observations may be taken, but once kk of the observations fall within the prediction interval, sampling can stop. Note: When k=mk=m and r=1r=1, this kind of simultaneous prediction interval becomes the same as a standard prediction interval for the next kk observations (see predIntNorm).

Davis and McNichols (1987) show that for a one-sided upper prediction interval (pi.type="upper"), the probability pp that at least kk of the next mm future observations will be contained in the interval given in Equation (5) above, for each of rr future sampling occasions, is given by:

p=01T(nK;n1,n[Φ1(v)+Δσ])r[I(v;k,m+1k)]r1[vk1(1v)mkB(k,m+1k)]dv            (6)p = \int_0^1 T(\sqrt{n}K; n-1, \sqrt{n}[\Phi^{-1}(v) + \frac{\Delta}{\sigma}]) r[I(v; k, m+1-k)]^{r-1} [\frac{v^{k-1}(1-v)^{m-k}}{B(k, m+1-k)}] dv \;\;\;\;\;\; (6)

where T(x;ν,δ)T(x; \nu, \delta) denotes the cdf of the non-central Student's t-distribution with parameters df=ν\nu and ncp=δ\delta evaluated at xx; Φ(x)\Phi(x) denotes the cdf of the standard normal distribution evaluated at xx; I(x;ν,ω)I(x; \nu, \omega) denotes the cdf of the beta distribution with parameters shape1=ν\nu and shape2=ω\omega; and B(ν,ω)B(\nu, \omega) denotes the value of the beta function with parameters a=ν\nu and b=ω\omega.

The quantity Δ\Delta (upper case delta) denotes the difference between the mean of the population that was sampled to construct the prediction interval, and the mean of the population that will be sampled to produce the future observations. The quantity σ\sigma (sigma) denotes the population standard deviation of both of these populations. Usually you assume Δ=0\Delta=0 unless you are interested in computing the power of the rule to detect a change in means between the populations, as we are here.

If we are interested in using averages instead of single observations, with w1w \ge 1 (i.e., n.mean1\ge 1), the first term in the integral in Equation (6) that involves the cdf of the non-central Student's t-distribution becomes:

T(nK;n1,nw[Φ1(v)+wΔσ])            (7)T(\sqrt{n}K; n-1, \frac{\sqrt{n}}{\sqrt{w}}[\Phi^{-1}(v) + \frac{\sqrt{w}\Delta}{\sigma}]) \;\;\;\;\;\; (7)

For a given confidence level (1α)100%(1-\alpha)100\%, the power of the rule to detect a change in means is simply given by:

Power=1p            (8)Power = 1 - p \;\;\;\;\;\; (8)

where pp is defined in Equation (6) above using the value of KK that corresponds to Δ/σ=0\Delta/\sigma = 0. Thus, when the argument delta.over.sigma=0, the value of pp is 1α1-\alpha and the power is simply α100%\alpha 100\%. As delta.over.sigma increases above 0, the power increases.

When pi.type="lower", the same value of K is used as when pi.type="upper", but Equation (4) is used to construct the prediction interval. Thus, the power increases as delta.over.sigma decreases below 0.

Power Based on the California Rule (rule="CA")
For the California rule (rule="CA"), with probability (1α)100%(1-\alpha)100\%, for each of the rr future sampling occasions, either the first observation will fall in the prediction interval, or else all of the next m1m-1 observations will fall in the prediction interval. That is, if the first observation falls in the prediction interval then sampling can stop. Otherwise, m1m-1 more observations must be taken.

The derivation of the power is the same as for the kk-of-mm rule, except that Equation (6) becomes the following (Davis, 1998b):

p=01T(nK;n1,n[Φ1(v)+Δσ])r{v[1+vm2(1v)]}r1[1+vm2(m1mv)]dv            (9)p = \int_0^1 T(\sqrt{n}K; n-1, \sqrt{n}[\Phi^{-1}(v) + \frac{\Delta}{\sigma}]) r\{v[1+v^{m-2}(1-v)]\}^{r-1} [1+v^{m-2}(m-1-mv)] dv \;\;\;\;\;\; (9)


Power Based on the Modified California Rule (rule="Modified.CA")
For the Modified California rule (rule="Modified.CA"), with probability (1α)100%(1-\alpha)100\%, for each of the rr future sampling occasions, either the first observation will fall in the prediction interval, or else at least 2 out of the next 3 observations will fall in the prediction interval. That is, if the first observation falls in the prediction interval then sampling can stop. Otherwise, up to 3 more observations must be taken.

The derivation of the power is the same as for the kk-of-mm rule, except that Equation (6) becomes the following (Davis, 1998b):

p=01T(nK;n1,n[Φ1(v)+Δσ])r{v[1+v(3v[52v])]}r1{1+v[6v(158v)]}dv            (10)p = \int_0^1 T(\sqrt{n}K; n-1, \sqrt{n}[\Phi^{-1}(v) + \frac{\Delta}{\sigma}]) r\{v[1+v(3-v[5-2v])]\}^{r-1} \{1+v[6-v(15-8v)]\} dv \;\;\;\;\;\; (10)

Value

vector of values between 0 and 1 equal to the probability that the rule will be violated.

Note

See the help file for predIntNormSimultaneous.

In the course of designing a sampling program, an environmental scientist may wish to determine the relationship between sample size, significance level, power, and scaled difference if one of the objectives of the sampling program is to determine whether two distributions differ from each other. The functions predIntNormSimultaneousTestPower and
plotPredIntNormSimultaneousTestPowerCurve can be used to investigate these relationships for the case of normally-distributed observations.

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

See the help file for predIntNormSimultaneous.

See Also

predIntNormSimultaneous, predIntNormSimultaneousK,
plotPredIntNormSimultaneousTestPowerCurve, predIntNorm, predIntNormK,
predIntNormTestPower, Prediction Intervals, Normal.

Examples

  # For the k-of-m rule with n=4, k=1, m=3, and r=1, show how the power increases 
  # as delta.over.sigma increases. Assume a 95% upper prediction interval.

  predIntNormSimultaneousTestPower(n = 4, m = 3, delta.over.sigma = 0:2) 
  #[1] 0.0500000 0.2954156 0.7008558

  #----------

  # Look at how the power increases with sample size for an upper one-sided 
  # prediction interval using the k-of-m rule with k=1, m=3, r=20, 
  # delta.over.sigma=2, and a confidence level of 95%.

  predIntNormSimultaneousTestPower(n = c(4, 8), m = 3, r = 20, delta.over.sigma = 2) 
  #[1] 0.6075972 0.9240924

  #----------

  # Compare the power for the 1-of-3 rule with the power for the California and 
  # Modified California rules, based on a 95% upper prediction interval and 
  # delta.over.sigma=2.  Assume a sample size of n=8.  Note that in this case the 
  # power for the Modified California rule is greater than the power for the 
  # 1-of-3 rule and California rule.

  predIntNormSimultaneousTestPower(n = 8, k = 1, m = 3, delta.over.sigma = 2) 
  #[1] 0.788171 

  predIntNormSimultaneousTestPower(n = 8, m = 3, rule = "CA", delta.over.sigma = 2) 
  #[1] 0.7160434 

  predIntNormSimultaneousTestPower(n = 8, rule = "Modified.CA", delta.over.sigma = 2) 
  #[1] 0.8143687

  #----------

  # Show how the power for an upper 95% simultaneous prediction limit increases 
  # as the number of future sampling occasions r increases.  Here, we'll use the 
  # 1-of-3 rule with n=8 and delta.over.sigma=1.

  predIntNormSimultaneousTestPower(n = 8, k = 1, m = 3, r=c(1, 2, 5, 10), 
    delta.over.sigma = 1) 
  #[1] 0.3492512 0.4032111 0.4503603 0.4633773

  #==========

  # USEPA (2009) contains an example on page 19-23 that involves monitoring 
  # nw=100 compliance wells at a large facility with minimal natural spatial 
  # variation every 6 months for nc=20 separate chemicals.  
  # There are n=25 background measurements for each chemical to use to create
  # simultaneous prediction intervals.  We would like to determine which kind of
  # resampling plan based on normal distribution simultaneous prediction intervals to
  # use (1-of-m, 1-of-m based on means, or Modified California) in order to have
  # adequate power of detecting an increase in chemical concentration at any of the
  # 100 wells while at the same time maintaining a site-wide false positive rate
  # (SWFPR) of 10% per year over all 4,000 comparisons 
  # (100 wells x 20 chemicals x semi-annual sampling).

  # The function predIntNormSimultaneousTestPower includes the argument "r" 
  # that is the number of future sampling occasions (r=2 in this case because 
  # we are performing semi-annual sampling), so to compute the individual test 
  # Type I error level alpha.test (and thus the individual test confidence level), 
  # we only need to worry about the number of wells (100) and the number of 
  # constituents (20): alpha.test = 1-(1-alpha)^(1/(nw x nc)).  The individual 
  # confidence level is simply 1-alpha.test.  Plugging in 0.1 for alpha, 
  # 100 for nw, and 20 for nc yields an individual test confidence level of 
  # 1-alpha.test = 0.9999473.

  nc <- 20
  nw <- 100
  conf.level <- (1 - 0.1)^(1 / (nc * nw))
  conf.level
  #[1] 0.9999473

  # Now we can compute the power of any particular sampling strategy using
  # predIntNormSimultaneousTestPower.  For example, here is the power of 
  # detecting an increase of three standard deviations in concentration using 
  # the prediction interval based on the "1-of-2" resampling rule:

  predIntNormSimultaneousTestPower(n = 25, k = 1, m = 2, r = 2, rule = "k.of.m", 
    delta.over.sigma = 3,  pi.type = "upper", conf.level = conf.level)
  #[1] 0.3900202

  # The following commands will reproduce the table shown in Step 2 on page
  # 19-23 of USEPA (2009).  Because these commands can take more than a few 
  # seconds to execute, we have commented them out here.  To run this example, 
  # just remove the pound signs (#) that are in front of R commands.

  #rule.vec <- c(rep("k.of.m", 3), "Modified.CA",  rep("k.of.m", 3))

  #m.vec <- c(2, 3, 4, 4, 1, 2, 1)

  #n.mean.vec <- c(rep(1, 4), 2, 2, 3)

  #n.scenarios <- length(rule.vec)

  #K.vec <- numeric(n.scenarios)

  #Power.vec <- numeric(n.scenarios)

  #K.vec <- predIntNormSimultaneousK(n = 25, k = 1, m = m.vec,  n.mean = n.mean.vec, 
  #  r = 2, rule = rule.vec,  pi.type = "upper", conf.level = conf.level)

  #Power.vec <- predIntNormSimultaneousTestPower(n = 25, k = 1, m = m.vec, 
  #  n.mean = n.mean.vec, r = 2, rule = rule.vec,  delta.over.sigma = 3, 
  #  pi.type = "upper",  conf.level = conf.level)

  #Power.df <- data.frame(Rule = rule.vec, k = rep(1, n.scenarios),  m = m.vec, 
  #  N.Mean = n.mean.vec, K = round(K.vec, 2),  Power = round(Power.vec, 2),
  #  Total.Samples = m.vec * n.mean.vec)

  #Power.df

  #         Rule k m N.Mean    K Power Total.Samples
  #1      k.of.m 1 2      1 3.16  0.39             2
  #2      k.of.m 1 3      1 2.33  0.65             3
  #3      k.of.m 1 4      1 1.83  0.81             4
  #4 Modified.CA 1 4      1 2.57  0.71             4
  #5      k.of.m 1 1      2 3.62  0.41             2
  #6      k.of.m 1 2      2 2.33  0.85             4
  #7      k.of.m 1 1      3 2.99  0.71             3

  # The above table shows the K-multipliers for each prediction interval, along with
  # the power of detecting a change in concentration of three standard deviations at
  # any of the 100 wells during the course of a year, for each of the sampling
  # strategies considered.  The last three rows of the table correspond to sampling
  # strategies that involve using the mean of two or three observations.

  #==========

  # Clean up
  #---------
  rm(nc, nw, conf.level, rule.vec, m.vec, n.mean.vec, n.scenarios, K.vec, 
    Power.vec, Power.df)

[Package EnvStats version 2.8.1 Index]