R: Compute the Value of K for a Prediction Interval for a Normal...

predIntNormK {EnvStats}

R Documentation

Compute the Value of `K` for a Prediction Interval for a Normal Distribution

Description

Compute the value of K (the multiplier of estimated standard deviation) used to construct a prediction interval for the next k observations or next set of k means based on data from a normal distribution. The function predIntNormK is called by predIntNorm.

Usage

  predIntNormK(n, df = n - 1, n.mean = 1, k = 1,
    method = "Bonferroni", pi.type = "two-sided",
    conf.level = 0.95)

Arguments

`n`	a positive integer greater than 2 indicating the sample size upon which the prediction interval is based.
`df`	the degrees of freedom associated with the prediction interval. The default is `df=n-1`.
`n.mean`	positive integer specifying the sample size associated with the `k` future averages. The default value is `n.mean=1` (i.e., individual observations). Note that all future averages must be based on the same sample size.
`k`	positive integer specifying the number of future observations or averages the prediction interval should contain with confidence level `conf.level`. The default value is `k=1`.
`method`	character string specifying the method to use if the number of future observations (`k`) is greater than 1. The possible values are `method="Bonferroni"` (approximate method based on Bonferonni inequality; the default), and `method="exact"` (exact method due to Dunnett, 1955). See the DETAILS section for more information. This argument is ignored if `k=1`.
`pi.type`	character string indicating what kind of prediction interval to compute. The possible values are `pi.type="two-sided"` (the default), `pi.type="lower"`, and `pi.type="upper"`.
`conf.level`	a scalar between 0 and 1 indicating the confidence level of the prediction interval. The default value is `conf.level=0.95`.

Details

A prediction interval for some population is an interval on the real line constructed so that it will contain k future observations or averages from that population with some specified probability (1-\alpha)100\%, where 0 < \alpha < 1 and k is some pre-specified positive integer. The quantity (1-\alpha)100\% is called the confidence coefficient or confidence level associated with the prediction interval.

Let \underline{x} = x_1, x_2, \ldots, x_n denote a vector of n observations from a normal distribution with parameters mean=\mu and sd=\sigma. Also, let m denote the sample size associated with the k future averages (i.e., n.mean=m). When m=1, each average is really just a single observation, so in the rest of this help file the term “averages” will replace the phrase “observations or averages”.

For a normal distribution, the form of a two-sided (1-\alpha)100\% prediction interval is:

[\bar{x} - Ks, \bar{x} + Ks] \;\;\;\;\;\; (1)

where \bar{x} denotes the sample mean:

\bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \;\;\;\;\;\; (2)

s denotes the sample standard deviation:

s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 \;\;\;\;\;\; (3)

and K denotes a constant that depends on the sample size n, the confidence level, the number of future averages k, and the sample size associated with the future averages, m. Do not confuse the constant K (uppercase K) with the number of future averages k (lowercase k). The symbol K is used here to be consistent with the notation used for tolerance intervals (see tolIntNorm).

Similarly, the form of a one-sided lower prediction interval is:

[\bar{x} - Ks, \infty] \;\;\;\;\;\; (4)

and the form of a one-sided upper prediction interval is:

[-\infty, \bar{x} + Ks] \;\;\;\;\;\; (5)

but K differs for one-sided versus two-sided prediction intervals. The derivation of the constant K is explained below. The function predIntNormK computes the value of K and is called by predIntNorm.

The Derivation of K for One Future Observation or Average (k = 1)
Let X denote a random variable from a normal distribution with parameters mean=\mu and sd=\sigma, and let x_p denote the p'th quantile of X.

A true two-sided (1-\alpha)100\% prediction interval for the next k=1 observation of X is given by:

[x_{\alpha/2}, x_{1-\alpha/2}] = [\mu - z_{1-\alpha/2}\sigma, \mu + z_{1-\alpha/2}\sigma] \;\;\;\;\;\; (6)

where z_p denotes the p'th quantile of a standard normal distribution.

More generally, a true two-sided (1-\alpha)100\% prediction interval for the next k=1 average based on a sample of size m is given by:

[\mu - z_{1-\alpha/2}\frac{\sigma}{\sqrt{m}}, \mu + z_{1-\alpha/2}\frac{\sigma}{\sqrt{m}}] \;\;\;\;\;\; (7)

Because the values of \mu and \sigma are unknown, they must be estimated, and a prediction interval then constructed based on the estimated values of \mu and \sigma.

For a two-sided prediction interval (pi.type="two-sided"), the constant K for a (1-\alpha)100\% prediction interval for the next k=1 average based on a sample size of m is computed as:

K = t_{n-1, 1-\alpha/2} \sqrt{\frac{1}{m} + \frac{1}{n}} \;\;\;\;\;\; (8)

where t_{\nu, p} denotes the p'th quantile of the Student's t-distribution with \nu degrees of freedom. For a one-sided prediction interval (pi.type="lower" or pi.type="lower"), the prediction interval is given by:

K = t_{n-1, 1-\alpha} \sqrt{\frac{1}{m} + \frac{1}{n}} \;\;\;\;\;\; (9)

The formulas for these prediction intervals are derived as follows. Let \bar{y} denote the future average based on m observations. Then the quantity \bar{y} - \bar{x} has a normal distribution with expectation and variance given by:

E(\bar{y} - \bar{x}) = 0 \;\;\;\;\;\; (10)

Var(\bar{y} - \bar{x}) = Var(\bar{y}) + Var(\bar{x}) = \frac{\sigma^2}{m} + \frac{\sigma^2}{n} = \sigma^2(\frac{1}{m} + \frac{1}{n}) \;\;\;\;\;\; (11)

so the quantity

t = \frac{\bar{y} - \bar{x}}{s\sqrt{\frac{1}{m} + \frac{1}{n}}} \;\;\;\;\;\; (12)

has a Student's t-distribution with n-1 degrees of freedom.

The Derivation of K for More than One Future Observation or Average (k >1)
When k > 1, the function predIntNormK allows for two ways to compute K: an exact method due to Dunnett (1955) (method="exact"), and an approximate (conservative) method based on the Bonferroni inequality (method="Bonferroni"; see Miller, 1981a, pp.8, 67-70; Gibbons et al., 2009, p.4). Each of these methods is explained below.

Exact Method Due to Dunnett (1955) (method="exact")
Dunnett (1955) derived the value of K in the context of the multiple comparisons problem of comparing several treatment means to one control mean. The value of K is computed as:

K = c \sqrt{\frac{1}{m} + \frac{1}{n}} \;\;\;\;\;\; (13)

where c is a constant that depends on the sample size n, the number of future observations (averages) k, the sample size associated with the k future averages m, and the confidence level (1-\alpha)100\%.

When pi.type="lower" or pi.type="upper", the value of c is the number that satisfies the following equation (Gupta and Sobel, 1957; Hahn, 1970a):

1 - \alpha = \int_{0}^{\infty} F_1(cs, k, \rho) h(s\sqrt{n-1}, n-1) \sqrt{n-1} ds \;\;\;\;\;\; (14)

where

F_1(x, k, \rho) = \int_{\infty}^{\infty} [\Phi(\frac{x + \rho^{1/2}y}{\sqrt{1 - \rho}})]^k \phi(y) dy \;\;\;\;\;\; (15)

\rho = 1 / (\frac{n}{m} + 1) \;\;\;\;\;\; (16)

h(x, \nu) = \frac{x^{\nu-1}e^{-x^2/2}}{2^{(\nu/2) - 1} \Gamma(\frac{\nu}{2})} \;\;\;\;\;\; (17)

and \Phi() and \phi() denote the cumulative distribution function and probability density function, respectively, of the standard normal distribution. Note that the function h(x, \nu) is the probability density function of a chi random variable with \nu degrees of freedom.

When pi.type="two-sided", the value of c is the number that satisfies the following equation:

1 - \alpha = \int_{0}^{\infty} F_2(cs, k, \rho) h(s\sqrt{n-1}, n-1) \sqrt{n-1} ds \;\;\;\;\;\; (18)

where

F_2(x, k, \rho) = \int_{\infty}^{\infty} [\Phi(\frac{x + \rho^{1/2}y}{\sqrt{1 - \rho}}) - \Phi(\frac{-x + \rho^{1/2}y}{\sqrt{1 - \rho}})]^k \phi(y) dy \;\;\;\;\;\; (19)

Approximate Method Based on the Bonferroni Inequality (method="Bonferroni")
As shown above, when k=1, the value of K is given by Equation (8) or Equation (9) for two-sided or one-sided prediction intervals, respectively. When k > 1, a conservative way to construct a (1-\alpha^*)100\% prediction interval for the next k observations or averages is to use a Bonferroni correction (Miller, 1981a, p.8) and set \alpha = \alpha^*/k in Equation (8) or (9) (Chew, 1968). This value of K will be conservative in that the computed prediction intervals will be wider than the exact predictions intervals. Hahn (1969, 1970a) compared the exact values of K with those based on the Bonferroni inequality for the case of m=1 and found the approximation to be quite satisfactory except when n is small, k is large, and \alpha is large. For example, Gibbons (1987a) notes that for a 99% prediction interval (i.e., \alpha = 0.01) for the next k observations, if n > 4, the bias of K is never greater than 1% no matter what the value of k.

Value

A numeric scalar equal to K, the multiplier of estimated standard deviation that is used to construct the prediction interval.

Note

Prediction and tolerance intervals have long been applied to quality control and life testing problems (Hahn, 1970b,c; Hahn and Nelson, 1973). In the context of environmental statistics, prediction intervals are useful for analyzing data from groundwater detection monitoring programs at hazardous and solid waste facilities (e.g., Gibbons et al., 2009; Millard and Neerchal, 2001; USEPA, 2009).

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers. Lewis Publishers, Boca Raton.

Dunnett, C.W. (1955). A Multiple Comparisons Procedure for Comparing Several Treatments with a Control. Journal of the American Statistical Association 50, 1096-1121.

Dunnett, C.W. (1964). New Tables for Multiple Comparisons with a Control. Biometrics 20, 482-491.

Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.

Hahn, G.J. (1969). Factors for Calculating Two-Sided Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 64(327), 878-898.

Hahn, G.J. (1970a). Additional Factors for Calculating Prediction Intervals for Samples from a Normal Distribution. Journal of the American Statistical Association 65(332), 1668-1676.

Hahn, G.J. (1970b). Statistical Intervals for a Normal Population, Part I: Tables, Examples and Applications. Journal of Quality Technology 2(3), 115-125.

Hahn, G.J. (1970c). Statistical Intervals for a Normal Population, Part II: Formulas, Assumptions, Some Derivations. Journal of Quality Technology 2(4), 195-206.

Hahn, G.J., and W.Q. Meeker. (1991). Statistical Intervals: A Guide for Practitioners. John Wiley and Sons, New York.

Hahn, G., and W. Nelson. (1973). A Survey of Prediction Intervals and Their Applications. Journal of Quality Technology 5, 178-188.

Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York.

Helsel, D.R., and R.M. Hirsch. (2002). Statistical Methods in Water Resources. Techniques of Water Resources Investigations, Book 4, chapter A3. U.S. Geological Survey. (available on-line at: https://pubs.usgs.gov/tm/04/a03/tm4a3.pdf).

Millard, S.P., and Neerchal, N.K. (2001). Environmental Statistics with S-PLUS. CRC Press, Boca Raton, Florida.

Miller, R.G. (1981a). Simultaneous Statistical Inference. McGraw-Hill, New York.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

Examples

  # Compute the value of K for a two-sided 95% prediction interval
  # for the next observation given a sample size of n=20.

  predIntNormK(n = 20)
  #[1] 2.144711

  #--------------------------------------------------------------------

  # Compute the value of K for a one-sided upper 99% prediction limit
  # for the next 3 averages of order 2 (i.e., each of the 3 future
  # averages is based on a sample size of 2 future observations) given a
  # samle size of n=20.

  predIntNormK(n = 20, n.mean = 2, k = 3, pi.type = "upper",
    conf.level = 0.99)
  #[1] 2.258026

  #----------

  # Compare the result above that is based on the Bonferroni method
  # with the exact method.

  predIntNormK(n = 20, n.mean = 2, k = 3, method = "exact",
    pi.type = "upper", conf.level = 0.99)
  #[1] 2.251084

  #--------------------------------------------------------------------

  # Example 18-1 of USEPA (2009, p.18-9) shows how to construct a 95%
  # prediction interval for 4 future observations assuming a
  # normal distribution based on arsenic concentrations (ppb) in
  # groundwater at a solid waste landfill.  There were 4 years of
  # quarterly monitoring, and years 1-3 are considered background,

  # So the sample size for the prediciton limit is n = 12,
  # and the number of future samples is k = 4.

  predIntNormK(n = 12, k = 4, pi.type = "upper")
  #[1] 2.698976

[Package EnvStats version 2.8.1 Index]

Compute the Value of K for a Prediction Interval for a Normal Distribution