R: Multiple Grubbs-Beck Test (MGBT) for Low Outliers

MGBT {MGBT}

R Documentation

Multiple Grubbs–Beck Test (MGBT) for Low Outliers

Description

Perform the Multiple Grubbs–Beck Test (MGBT; Cohn and others, 2013) for low outliers (LOTs, low-outlier threshold; potentially influential low floods, PILFs) that is implemented in the USGS-PeakFQ software (USGS, 2014; Veilleux and others, 2014) for implementation of Bulletin 17C (B17C) (England and others, 2018). The test internally transforms the data to logarithms (base-10) and thus is oriented for positively distributed data but accommodates zeros in the dataset.

The essence of the MGBT, given the order statistics x_{[1:n]} \le x_{[2:n]} \le \cdots \le x_{[(n-1):n]} \le x_{[n:n]}, is the statistic

GB_r = \omega_r = \frac{ x_{[r:n]} - \mathrm{mean}\{x_{[(r+1)\rightarrow n:n]}\} } {\sqrt{\mathrm{var}\{x_{[(r+1)\rightarrow n:n]}\}}}\mbox{,}

which is can be computed by MGBTcohn2011 that is a port a function of TAC's used in a testing script that is reproduced in the Examples of RSlo. Variations of this pseudo-standardization scheme are shown for BLlo and RSlo. Also, GB_r is the canonical form of the variable eta in TAC sources and peta=peta will be its associated probability.

Usage

MGBT(...) # A wrapper on MGBT17C()---This is THE function for end users.

     MGBT17c(x, alphaout=0.005, alphain=0, alphazeroin=0.10,
                n2=floor(length(x)/2), napv.zero=TRUE, offset=0, min.obs=0)
MGBT17c.verb(x, alphaout=0.005, alphain=0, alphazeroin=0.10,
                n2=floor(length(x)/2), napv.zero=TRUE, offset=0, min.obs=0)

MGBTcohn2016(x, alphaout=0.005, alphazeroin=0.10, n2=floor(length(x)/2),
                napv.zero=TRUE, offset=0)
MGBTcohn2013(x, alphaout=0.005, alphazeroin=0.10, n2=floor(length(x)/2),
                napv.zero=TRUE, offset=0)
      MGBTnb(x, alphaout=0.005, alphazeroin=0.10, n2=floor(length(x)/2),
                napv.zero=TRUE, offset=0)

MGBTcohn2011(x, r=NULL, n=length(x)) # only computes the GB_r, not a test

Arguments

`...`	Arguments to pass to the MGBT family of functions;
`x`	The data values and note that base-10 logarithms of these are computed internally except for the operation of the `MGBTcohn2011` function, which does not (see Examples for `RSlo`). Also protection from zero or negative values is made by the R function `pmax`, and these values are replaced with a “small” value of `1e-8` and tacitly TAC has assumed that p-values for these will be significantly small and truncated away;
`alphaout`	Literally the `\alpha_\mathrm{out}` of Bulletin 17C. This is the significance level of the “sweep out” portion of MGBT;
`alphain`	This is the significance level of the “sweep in” portion of MGBT but starts at one plus the order statistic identified by `alphaout`;
`alphazeroin`	Literally the `\alpha_\mathrm{in}` of Bulletin 17C. This is the significance level of the “sweep in” portion of MGBT;
`napv.zero`	A logical switch to reset a returned `NA` from `RthOrderPValueOrthoT` to zero. This is a unique extension by WHA based on large-scale batch testing of the USGS peak-values database (see Note). This being said, the fall-back to Monte Carlo integration if the numerical integration fails, seems to mostly make this argument superfluous;
`offset`	The offset, if not `NA`, is added from the threshold unless the threshold itself is already zero. In practical application, this offset, if set, would likely be a negative quantity. This argument is a unique extension by WHA;
`min.obs`	The minimum number of observations. This option is provided to streamline larger applications, but the underlying logic in `MGBT17C` is robust and on failures because of small sizes return a threshold of `0` anyway;
`n2`	The number of `n2`-smallest values to be evaluated in the MGBT;
`r`	The number of truncated observations, which can be though of the rth order statistic and below; and
`n`	The number of observations. It is not clear that TAC intended `n` to be not equal to the sample size but TAC chose to not keep the length of `x` as determined internally to the function but to have it also available as an argument. Functions `BLlo` and `RSlo` also were designed similarly.

Value

The MGBT results as an R list:

`index`	The sample size `n`, the value for `n2`, and the three indices of the “sweep out,” “sweep in,” and “sweep in from zero” processing (only for `MGBT17c` as this is an extension from TAC);
`omegas`	The `GB_r = \omega_r` statistics for which the p-values in `pvalues` are shown. These are mostly returned for aid in debugging and verification of the algorithms;
`x`	The `n2`-smallest values in increasing order (only for `MGBT17c` as this is an extension from TAC);
`pvalues`	The p-values of the `n2`-smallest values of the sample (not available for `MGBT17c` because of algorithm design for speed);
`klow`	The number of low outliers detected;
`LOThresh`	The low-outlier threshold for the `klow+1` index of the sample (and possibly adjusted by the `offset`) or simply zero; and
`message`	Possibly message in event of some internal difficulty.

The inclusion of x in the returned value is to add symmetry because the p-values are present. The inclusion of n and n2 might make percentage computations of inward and outward sweep indices useful in exploratory analyses. Finally, the inclusion of the sweep indices is important as it was through inspection of these that the problems in TAC sources were discovered.

Note

Porting from TAC sources—TAC used MGBT for flood-frequency computations by a call

  oMGBT <- MGBT(Q=o_in@qu[o_in@typeSystematic])

in file P3_089(R).txt, and note the named argument Q= but consider in the definition Q is not defined as a named argument. For the MGBT package, the Q has been converted to a more generic variable x. Development of TAC's B17C version through the P3_089(R).txt or similar sources will simply require Q= to be removed from the MGBT call.

The original MGBT algorithms in R by TAC will throw some errors and warnings that required testing elsewhere for completeness (non-NULL) in algorithms “up the chain.” These errors appear to matter materially in pratical application in large-scale batch processing of USGS Texas peak-values data by WHA and GRH. For package MGBT, the TAC computations have been modified to wrap a try() around the numerical integration within RthOrderPValueOrthoT of peta, and insert a NA when the integration fails if and only if a second integration (fall-back) attempt using Monte Carlo integration fails as well.

The following real-world data were discovered to trigger the error/warning messages. These example data crash TAC's MGBT and the data point of 25 cubic feet per second (cfs) is the culpret. If a failure is detected, Monte Carlo integration is attempted as a fall-back procedure using defaults of RthOrderPValueOrthoT, and if that integration succeeds, MGBT, which is not aware, simply receives the p-value. If Monte Carlo integration also fails, then for the implementation in package MGBT, the p-value is either a NA (napv.zero=FALSE) or set to zero if napv.zero=TRUE. Evidence suggests that numerical difficulties are encountered when small p-values are involved.

  # Peak streamflows for 08385600 (1952--2015, systematic record only)
  #https://nwis.waterdata.usgs.gov/nwis/peak?site_no=08385600&format=hn2
  Data  <- c(  8100, 3300,  680, 14800,  25.0, 7310, 2150, 1110, 5200, 900, 1150,
   1050,  880, 2100, 2280, 2620,   830,  4900,  970,  560,  790, 1900, 830,  255,
   2900, 2100,    0,  550, 1200,  1300,   246,  700,  870, 4350,  870, 435, 3000,
    880, 2650,  185,  620, 1650,   680, 22900, 3290,  584, 7290, 1690, 2220, 217,
   4110,  853,  275, 1780, 1330,  3170,  7070, 2660) # cubic feet per second (cfs)
  MGBT17c(     Data, napv.zero=TRUE)$LOThres    # [1] 185
  MGBT17c.verb(Data, napv.zero=TRUE)$LOThres    # [1] 185
  MGBTcohn2016(Data, napv.zero=TRUE)$LOThres    # [1] 185
  MGBTcohn2013(Data, napv.zero=TRUE)$LOThres    # [1] 185
  MGBTnb(      Data, napv.zero=TRUE)$LOThres    # [1] 185

Without having the fall-back Monte Carlo integration in RthOrderPValueOrthoT, if napv.zero= FALSE, then the low-outlier threshold is 25 cfs, but if napv.zero=TRUE, then the low-outlier threshold is 185 cfs, which is the value matching USGS-PeakFQ (v7.1) (FORTRAN code base). Hence, the recommendation that napv.zero=TRUE for the default, though such a setting will for this example will still result in 185 cfs because Monte Carlo integration can not be turned off for the implementation here.

Noting that USGS-PeakFQ (7.1) reports the p-value for the 25 cfs as 0.0002, a test of the MGBT implementation with the backup Monte Carlo integration in RthOrderPValueOrthoT shows a p-value of 1.748946e-04 for 25 cfs, which is congruent with the 0.0002 of USGS-PeakFQ (v7.1). Another Monte Carlo integration produced a p-value of 1.990057e-04 for 25 cfs, and thus another result congruent with USGS-PeakFQ (v7.1) is evident.

Using original TAC sources, here are the general errors that can be seen:

   Error in integrate(peta, lower = 1e-07, upper = 1 - 1e-07, n = n, r = r, :
   the integral is probably divergent In addition:
   In pt(q, df = df, ncp = ncp, lower.tail = TRUE) :
     full precision may not have been achieved in 'pnt[final]'

For both the internal implementations of RthOrderPValueOrthoT and peta error trapping is present to return a NA. It is not fully known whether the integral appears divergent when pt() (probability of the t-distribution) reaches an end point in apparent accuracy or not—although, this is suspected. For this package, a suppressWarnings() has been wrapped around the call to pt() in peta as well as in one other computation that at least in small samples can result in a square root of a negative number (see Note under peta).

There is another error to trap for this package. If all the data values are identical, a low-outlier threshold set at that value leaks back. This is the motivation for a test added by WHA using length(unique(x)) == 1 in the internals of MGBTcohn2016, MGBTcohn2013, and MGBTnb.

A known warning message might be seen at least in microsamples:

   MGBT(c( 1, 26300))       # throws    warnings, zero is the threshold
     # In EMS(n, r, qmin) : value out of range in 'lgamma'
   MGBT(c( 1, 26300, 2600)) # throws no warnings, zero is the threshold

The author wraps a suppressWarnings() on the line in EMS requiring lgamma(). This warning appears restricted to nearly a degenerate situation anyway and failure will result in peta and therein the situation results in a p-value of unity and hence no risk of identifying a threshold.

Regarding n=length(x) for MGBTcohn2011, it is not clear whether TAC intended n to be not equal to the sample size. TAC chose to not determine the length of x internally to the function but to have it available as an argument. Also BLlo and RSlo were designed similarly.

(1) Lingering Issue of Inquiry—TAC used a j1 index in MGBTcohn2016, MGBTcohn2013, and MGBTnb, and this index is used as part of alphaout. The j1 and is not involved in the returned content of the MGBT approach. This seems to imply a problem with the “sweep out” approach but TAC inadvertantly seems to make “sweep out” work in MGBTnb but therein creates a “sweep in” problem. The “sweep in” appears fully operational in MGBTcohn2016 and MGBTcohn2013. Within the source for MGBTcohn2016 and MGBTcohn2013, a fix can be made with just one line after the n2 values have been processed. Here is the WHA commentary within the sources, and it is important to note that the fix is not turned on because use of MGBT17c via the wrapper MGBT is the recommended interface:

   # ---***--------------***--- TAC CRITICAL BUG ---***--------------***---
   # j2 <- min(c(j1,j2)) # WHA tentative completion of the 17C alogrithm!?!
   # HOWEVER MAJOR WARNING. WHA is using a minimum and not a maximum!!!!!!!
   # See MGBT17C() below. In that if the line a few lines above that reads
   # if((pvalueW[i] < alpha1 )) { j1 <- i; j2 <- i }
   # is replaced with if((pvalueW[i] < alpha1 )) j1 <- i
   # then maximum and not the minimum becomes applicable.
   # ---***--------------***--- TAC CRITICAL BUG ---***--------------***---

(2) Lingering Issue of Inquiry—TAC used recursion in MGBTcohn2013 and MGBTnb for a condition j2 == n2. This recursion does not exist in MGBTcohn2016. TAC seems to have explored the idea of a modification to the n2 to be related to an idea of setting a limit of at least five retained observations below half the sample (default n2) when up to half the sample is truncated away. Also in the recursion, TAC resets the two alpha's with alphaout=0.01 from the default of alpha1=0.005 and alphazeroin=0.10. This reset means that had TAC hardwired these inside MGBTcohn2013, which partially defeats the purpose of having them as arguments in the first place.

(3) Lingering Issue of Inquiry—TAC used recursion in MGBTcohn2013 and MGBTnb for a condition j2 == n2 but the recursion calls MGBTcohn2013. Other comments about the recursion in Inquiry (2) about the alpha's are applicable here as well. More curious about MGBTnb is that the alphazeroin is restricted to a test on only the p-value for the smallest observation and is made outside the loop through the data. The test is if(pvalueW[1] < alphazeroin & j2 == 0) j2 <- 1, which is a logic flow that differs from that in MGBTcohn2013 and MGBTcohn2016.

On the Offset Argument—The MGBT approach identifies the threshold as the first order statistic (x_{[r+1:n]}) above that largest “outlying” order statistic x_{[r:n]} with the requisite small p-value (see the example in the Examples section herein). There is a practical application in which a nudge on the MGBT-returned low-outlier threshold could be useful. Consider that the optional offset argument is added to the threshold unless the threshold itself is already zero. The offset should be small, and as an example {-}0.001 would be a magnitude below the resolution of the USGS peak-values database.

Why? If algorithms other than USGS-PeakFQ are involved in frequency analyses, the concept of “in” or “out” of analyses, respectively, could be of the form xin <- x[x > threshold] and xlo <- x[x <= threshold]. Note the “less than or equal to” and its assocation with those data to be truncated away, which might be more consistent with the idea of truncation level of left-tail censored data in other datasets for which the MGBT approach might be used. For example, the x2xlo() function of the lmomco package by Asquith (2019) uses such logic in its implementation of conditional probability adjustment for the presence of low outliers. This logic naturally supports a default truncation for zeros.

Perhaps one reason for the difference in how a threshold is implemented is based on two primary considerations. First, if restricted to choosing a threshold to the sample order statistics, then the MGBT approach, by returning the first (earliest) order statistics that is not statistically significant, requires x[x < threshold] for the values to leave out. TAC clearly intended this form. Second, TAC seems to approach the zero problem with MGBT by replacing all zeros with 1e-8 as in

  log10(pmax(1e-8,x))

immediately before the logarithmic transformation. This might be a potential weak link in MGBT. It assumes that 1e-8 is small, which it certainly is for the problem of flood-frequency analysis using peaks in cubic feet per second. TAC has hardwired this value. Reasonable enough.

But such logic thus requires the MGBT to identify these pseudo-zeros (now 1e-8) in all circumstances as low outliers. Do the algorithms otherwise do this? This approach is attractive for B17C because one does not have to track a subsample of those values greater than zero because the low-outlier test will capture them. An implementation such as Asquith (2019) automatically removes zero without the need to use a low-outlier identification method; hence, Asquith's choice of x[x <= threshold] with threshold=0 by default for the values to leave out. The inclusion of offset permits cross compatibility for MGBT package purposes trancending Bulletin 17C.

Author(s)

W.H. Asquith

Source

LowOutliers_jfe(R).txt, LowOutliers_wha(R).txt, P3_089(R).txt—Named MGBT + MGBTnb

References

Asquith, W.H., 2019, lmomco—L-moments, trimmed L-moments, L-comoments, censored
L-moments, and many distributions: R package version 2.3.2 (September 20, 2018), accessed March 30, 2019, at https://cran.r-project.org/package=lmomco.

Cohn, T.A., 2013–2016, Personal communication of original R source code: U.S. Geological Survey, Reston, Va.

Cohn, T.A., England, J.F., Berenbrock, C.E., Mason, R.R., Stedinger, J.R., and Lamontagne, J.R., 2013, A generalized Grubbs-Beck test statistic for detecting multiple potentially influential low outliers in flood series: Water Resources Research, v. 49, no. 8, pp. 5047–5058.

England, J.F., Cohn, T.A., Faber, B.A., Stedinger, J.R., Thomas Jr., W.O., Veilleux, A.G., Kiang, J.E., and Mason, R.R., 2018, Guidelines for determining flood flow frequency Bulletin 17C: U.S. Geological Survey Techniques and Methods, book 4, chap. 5.B, 148 p., doi: 10.3133/tm4B5

U.S. Geological Survey (USGS), 2018, PeakFQ—Flood frequency analysis based on Bulletin 17B and recommendations of the Advisory Committee on Water Information (ACWI) Subcommittee on Hydrology (SOH) Hydrologic Frequency Analysis Work Group (HFAWG), version 7.2: Accessed November 29, 2018, at https://water.usgs.gov/software/PeakFQ/.

Veilleux, A.G., Cohn, T.A., Flynn, K.M., Mason, R.R., Jr., and Hummel, P.R., 2014, Estimating magnitude and frequency of floods using the PeakFQ 7.0 program: U.S. Geological Survey Fact Sheet 2013–3108, 2 p., doi: 10.3133/fs20133108.

Examples


# USGS 08066300 (1966--2016) # cubic feet per second (cfs)
#https://nwis.waterdata.usgs.gov/nwis/peak?site_no=08066300&format=hn2
Values <- c(3530, 284, 1810, 9660,  489,  292, 1000,  2640, 2910, 1900,  1120, 1020,
   632, 7160, 1750,  2730,  1630, 8210, 4270, 1730, 13200, 2550,  915, 11000, 2370,
  2230, 4650, 2750,  1860, 13700, 2290, 3390, 5160, 13200,  410, 1890,  4120, 3930,
  4290, 1890, 1480, 10300,  1190, 2320, 2480, 55.0,  7480,  351,  738,  2430, 6700)
MGBT(Values) # Results LOT=284 cfs leaving 55.0 cfs (p-value=0.0119) censored.
#$index
#      n      n2    ix_alphaout     ix_alphain   ix_alphazeroin
#     51      25              0              0                1
#$omegas
# [1] -3.781980 -2.268554 -2.393569 -2.341027 -2.309990 -2.237571
# [7] -2.028614 -1.928391 -1.720404 -1.673523 -1.727138 -1.671534
#[13] -1.661346 -1.391819 -1.293324 -1.246974 -1.276485 -1.272878
#[19] -1.280917 -1.310286 -1.372402 -1.434898 -1.226588 -1.237743
#[25] -1.276794
#$x
# [1]   55  284  292  351  410  489  632  738  915 1000 1020 1120 1190 1480 1630 1730
#[17] 1750 1810 1860 1890 1890 1900 2230 2290 2320
#$pvalues
# [1] 0.01192184 0.30337879 0.08198836 0.04903091 0.02949836 0.02700114 0.07802324
# [8] 0.11185553 0.31531749 0.34257170 0.21560086 0.25950150 0.24113157 0.72747052
#[15] 0.86190920 0.89914152 0.84072131 0.82381908 0.78750571 0.70840262 0.55379730
#[22] 0.40255392 0.79430336 0.75515103 0.66031442
#$LOThresh
#[1] 284

# The USGS-PeakFQ (v7.1) software reports:
#   EMA003I-PILFS (LOS) WERE DETECTED USING MULTIPLE GRUBBS-BECK TEST   1     284.0
#      THE FOLLOWING PEAKS (WITH CORRESPONDING P-VALUES) WERE CENSORED:
#            55.0    (0.0123)
# As a curiosity, see Examples under ASlo().#


# MGBTnb() has a sweep in problem.
SweepIn <- c(1, 1, 3200, 5270, 26300, 38400, 8710, 23200, 39300, 27800, 21000,
  21000, 21500, 57000, 53700, 5720, 10700, 4050, 4890, 10500, 26300, 16600, 20900,
  21400, 10800, 8910, 6360)  # sweep in and out both identify index 2.
MGBT17c(SweepIn, alphaout=0)$LOThres # LOT = 3200 # force no sweep outs
MGBTnb(SweepIn)$LOThres              # LOT = 3200 # because sweep out is the same!
MGBTnb(SweepIn, alphaout=0) # LOT = 1  # force no sweep outs, it fails.

[Package MGBT version 1.0.7 Index]