R: Goodness-of-Fit Test for a Specified Probability Distribution...

gofGroupTest {EnvStats}

R Documentation

Goodness-of-Fit Test for a Specified Probability Distribution for Groups

Description

Perform a goodness-of-fit test to determine whether data in a set of groups appear to all come from the same probability distribution (with possibly different parameters for each group).

Usage

gofGroupTest(object, ...)

## S3 method for class 'formula'
gofGroupTest(object, data = NULL, subset, 
  na.action = na.pass, ...)

## Default S3 method:
gofGroupTest(object, group, test = "sw", 
  distribution = "norm", est.arg.list = NULL, n.classes = NULL, 
  cut.points = NULL, param.list = NULL, 
  estimate.params = ifelse(is.null(param.list), TRUE, FALSE), 
  n.param.est = NULL, correct = NULL, digits = .Options$digits, 
  exact = NULL, ws.method = "normal scores", 
  data.name = NULL, group.name = NULL, parent.of.data = NULL, 
  subset.expression = NULL, ...)

## S3 method for class 'data.frame'
gofGroupTest(object, ...)

## S3 method for class 'matrix'
gofGroupTest(object, ...)

## S3 method for class 'list'
gofGroupTest(object, ...)

Arguments

`object`	an object containing data for 2 or more groups to be compared to the hypothesized distribution specified by `distribution`. In the default method, the argument `object` must be a numeric vector. When `object` is a data frame, all columns must be numeric. When `object` is a matrix, it must be a numeric matrix. When `object` is a list, all components must be numeric vectors. In the formula method, a symbolic specification of the form `y ~ g` can be given, indicating the observations in the vector `y` are to be grouped according to the levels of the factor `g`. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are allowed but will be removed.
`data`	when `object` is a formula, `data` specifies an optional data frame, list or environment (or object coercible by `as.data.frame` to a data frame) containing the variables in the model. If not found in `data`, the variables are taken from `environment(formula)`, typically the environment from which `summaryStats` is called.
`subset`	when `object` is a formula, `subset` specifies an optional vector specifying a subset of observations to be used.
`na.action`	when `object` is a formula, `na.action` specifies a function which indicates what should happen when the data contain `NA`s. The default is `na.pass`.
`group`	when `object` is a numeric vector, `group` is a factor or character vector indicating which group each observation belongs to. When `object` is a matrix or data frame this argument is ignored and the columns define the groups. When `object` is a list this argument is ignored and the components define the groups. When `object` is a formula, this argument is ignored and the right-hand side of the formula specifies the grouping variable.
`test`	character string defining which goodness-of-fit test to perform on each group. Possible values are: `"sw"` (Shapiro-Wilk; the default), `"sf"` (Shapiro-Francia), `"ppcc"` (Probability Plot Correlation Coefficient), `"skew"` (Zero-skew), `"chisq"` (Chi-squared), `"ks"` (Kolmogorov-Smirnov), and `"ws"` (Wilk-Shapiro test for Uniform [0, 1] distribution).
`distribution`	a character string denoting the distribution abbreviation. See the help file for `Distribution.df` for a list of distributions and their abbreviations. The default value is `distribution="norm"` (Normal distribution). When `test="sw"`, `test="sf"`, or `test="ppcc"`, any continuous distribuiton is allowed (e.g., `"norm"` (normal), `"lnorm"` (lognormal), `"gamma"` (gamma), etc.), as well as mixed distributions involving the normal distribution (i.e., `"zmnorm"` (zero-modified normal), `"zmlnorm"` (zero-modified lognormal (delta)), and `"zmlnorm.alt"` (zero-modified lognormal with alternative parameterization)). When `test="skew"`, only the values `"norm"` (normal), `"lnorm"` (lognormal), `"lnorm.alt"` (lognormal with alternative parameterization), `"zmnorm"` (zero-modified normal), `"zmlnorm"` (zero-modified lognormal (delta)), and `"zmlnorm.alt"` (zero-modified lognormal with alternative parameterization) are allowed. When `test="ks"`, any continuous distribution is allowed. When `test="chisq"`, any distribuiton is allowed. When `test="ws"`, this argument is ignored.
`est.arg.list`	a list of arguments to be passed to the function estimating the distribution parameters for each group of observations. For example, if `test="sw"` and `distribution="gamma"`, setting `est.arg.list=list(method="bcmle")` indicates using the bias-corrected maximum-likelihood estimators of shape and scale (see the help file for `egamma`. See the help file Estimating Distribution Parameters for a list of estimating functions. The default value is `est.arg.list=NULL` so that all default values for the estimating function are used. This argument is ignored if `estimate.params=FALSE`. When `test="sw"`, `test="sf"`, `test="ppcc"`, or `test="skew"`, and you are testing for some form of normality (i.e., Normal, Lognormal, Three-Parameter Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta)), the estimated parameters are provided in the output merely for information, and the choice of the method of estimation has no effect on the goodness-of-fit test statistics or p-values. When `test="ks"`, and `estimate.params=TRUE`, the estimated parameters are used to specify the null hypothesis of which distribution the data are assumed to come from. When `test="chisq"` and `estimate.params=TRUE`, the estimated parameters are used to specify the null hypothesis of which distribution the data are assumed to come from. When `test="ws"`, this argument is ignored.
`n.classes`	for the case when `test="chisq"`, the number of cells into which the observations within each group are to be allocated. If the argument `cut.points` is supplied, then `n.classes` is set to `length(cut.points)-1`. The default value is `ceiling(2* (length(x)^(2/5)))` and is recommended by Moore (1986).
`cut.points`	for the case when `test="chisq"`, a vector of cutpoints that defines the cells for each group of observations. The element `x[i]` is allocated to cell `j` if `cut.points[j]` < `x[i]` `\le` `cut.points[j+1]`. If `x[i]` is less than or equal to the first cutpoint or greater than the last cutpoint, then `x[i]` is treated as missing. If the hypothesized distribution is discrete, `cut.points` must be supplied. The default value is `cut.points=NULL`, in which case the cutpoints are determined by `n.classes` equi-probable intervals.
`param.list`	for the case when `test="ks"` or `test="chisq"`, a list with values for the parameters of the specified distribution. See the help file for `Distribution.df` for the names and possible values of the parameters associated with each distribution. The default value is `NULL`, which forces estimation of the distribution parameters. This argument is ignored if `estimate.params=TRUE`.
`estimate.params`	for the case when `test="ks"` or `test="chisq"`, a logical scalar indicating whether to perform the goodness-of-fit test based on estimating the distribution parameters (`estimate.params=TRUE`) or using the user-supplied distribution parameters specified by `param.list` (`estimate.params=FALSE`). The default value of `estimate.params` is `TRUE` if `param.list=NULL`, otherwise it is `FALSE`.
`n.param.est`	for the case when `test="ks"` or `test="chisq"`, an integer indicating the number of parameters estimated from the data. If `estimate.params=TRUE`, the default value is the number of parameters associated with the distribution specified by `distribution` (e.g., 2 for a normal distribution). If `estimate.params=FALSE`, the default value is `n.param.est=0`.
`correct`	for the case when `test="chisq"`, a logical scalar indicating whether to use the continuity correction. The default value is `correct=FALSE` unless `n.classes=2`.
`digits`	a scalar indicating how many significant digits to print out for the parameters associated with the hypothesized distribution. The default value is `.Options$digits`.
`exact`	for the case when `test="ks"`, `exact=NULL` by default, but can be set to a logical scalar indicating whether an exact p-value should be computed. See the help file for `ks.test` for more information.
`ws.method`	character string indicating which method to use when performing the Wilk-Shapiro test for a Uniform [0,1] distribution on the p-values from the goodness-of-fit tests on each group. Possible values are `ws.method="normal scores"` (the default) or `ws.method="chi-square scores"`. See the subsection Wilk-Shapiro goodness-of-fit test for Uniform [0, 1] distribution under the DETAILS section of the help file for `gofTest` for more information. NOTE: In the case where you are testing whether each group comes from a Uniform [0,1] distribution (i.e., when you set `test="ws"`), the argument `ws.method` determines which score types are used for each individual test of the groups as well.
`data.name`	character string indicating the name of the data used for the goodness-of-fit tests. The default value is `data.name=deparse(substitute(object))`.
`group.name`	character string indicating the name of the data used to create the groups. The default value is `group.name=deparse(substitute(group))`.
`parent.of.data`	character string indicating the source of the data used for the goodness-of-fit tests.
`subset.expression`	character string indicating the expression used to subset the data.
`...`	additional arguments affecting the goodness-of-fit test.

Details

The function gofGroupTest performs a goodness-of-fit test for each group of data by calling the function gofTest. Using the p-values from these goodness-of-fit tests, it then calls the function gofTest with the argument test="ws" to test whether the p-values appear to come from a Uniform [0,1] distribution.

Value

a list of class "gofGroup" containing the results of the group goodness-of-fit test. Objects of class "gofGroup" have special printing and plotting methods. See the help file for gofGroup.object for details.

Note

The Wilk-Shapiro (1968) tests for a Uniform [0, 1] distribution were introduced in the context of testing whether several independent samples all come from normal distributions, with possibly different means and variances. The function gofGroupTest extends this idea to allow you to test whether several independent samples come from the same distribution (e.g., gamma, extreme value, etc.), with possibly different parameters.

Examples of simultaneously assessing whether several groups come from the same distribution are given in USEPA (2009) and Gibbons et al. (2009).

In practice, almost any goodness-of-fit test will not reject the null hypothesis if the number of observations is relatively small. Conversely, almost any goodness-of-fit test will reject the null hypothesis if the number of observations is very large, since “real” data are never distributed according to any theoretical distribution (Conover, 1980, p.367). For most cases, however, the distribution of “real” data is close enough to some theoretical distribution that fairly accurate results may be provided by assuming that particular theoretical distribution. One way to asses the goodness of the fit is to use goodness-of-fit tests. Another way is to look at quantile-quantile (Q-Q) plots (see qqPlot).

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. p.17-17.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

Wilk, M.B., and S.S. Shapiro. (1968). The Joint Assessment of Normality of Several Independent Samples. Technometrics, 10(4), 825-839.

Examples

  # Example 10-4 of USEPA (2009, page 10-20) gives an example of 
  # simultaneously testing the assumption of normality for nickel 
  # concentrations (ppb) in groundwater collected at 4 monitoring 
  # wells over 5 months.  The data for this example are stored in 
  # EPA.09.Ex.10.1.nickel.df.

  EPA.09.Ex.10.1.nickel.df
  #   Month   Well Nickel.ppb
  #1      1 Well.1       58.8
  #2      3 Well.1        1.0
  #3      6 Well.1      262.0
  #4      8 Well.1       56.0
  #5     10 Well.1        8.7
  #6      1 Well.2       19.0
  #7      3 Well.2       81.5
  #8      6 Well.2      331.0
  #9      8 Well.2       14.0
  #10    10 Well.2       64.4
  #11     1 Well.3       39.0
  #12     3 Well.3      151.0
  #13     6 Well.3       27.0
  #14     8 Well.3       21.4
  #15    10 Well.3      578.0
  #16     1 Well.4        3.1
  #17     3 Well.4      942.0
  #18     6 Well.4       85.6
  #19     8 Well.4       10.0
  #20    10 Well.4      637.0


  # Test for a normal distribution at each well:
  #--------------------------------------------

  gofGroup.list <- gofGroupTest(Nickel.ppb ~ Well, 
    data = EPA.09.Ex.10.1.nickel.df)

  gofGroup.list

  #Results of Group Goodness-of-Fit Test
  #-------------------------------------
  #
  #Test Method:                     Wilk-Shapiro GOF (Normal Scores)
  #
  #Hypothesized Distribution:       Normal
  #
  #Data:                            Nickel.ppb
  #
  #Grouping Variable:               Well
  #
  #Data Source:                     EPA.09.Ex.10.1.nickel.df
  #
  #Number of Groups:                4
  #
  #Sample Sizes:                    Well.1 = 5
  #                                 Well.2 = 5
  #                                 Well.3 = 5
  #                                 Well.4 = 5
  #
  #Test Statistic:                  z (G) = -3.658696
  #
  #P-values for
  #Individual Tests:                Well.1 = 0.03510747
  #                                 Well.2 = 0.02385344
  #                                 Well.3 = 0.01120775
  #                                 Well.4 = 0.10681461
  #
  #P-value for
  #Group Test:                      0.0001267509
  #
  #Alternative Hypothesis:          At least one group
  #                                 does not come from a
  #                                 Normal Distribution.

  dev.new()
  plot(gofGroup.list)

  #----------

  # Test for a lognormal distribution at each well:
  #-----------------------------------------------

  gofGroupTest(Nickel.ppb ~ Well, data = EPA.09.Ex.10.1.nickel.df, 
    dist = "lnorm")

  #Results of Group Goodness-of-Fit Test
  #-------------------------------------
  #
  #Test Method:                     Wilk-Shapiro GOF (Normal Scores)
  #
  #Hypothesized Distribution:       Lognormal
  #
  #Data:                            Nickel.ppb
  #
  #Grouping Variable:               Well
  #
  #Data Source:                     EPA.09.Ex.10.1.nickel.df
  #
  #Number of Groups:                4
  #
  #Sample Sizes:                    Well.1 = 5
  #                                 Well.2 = 5
  #                                 Well.3 = 5
  #                                 Well.4 = 5
  #
  #Test Statistic:                  z (G) = 0.2401720
  #
  #P-values for
  #Individual Tests:                Well.1 = 0.6898164
  #                                 Well.2 = 0.6700394
  #                                 Well.3 = 0.3208299
  #                                 Well.4 = 0.5041375
  #
  #P-value for
  #Group Test:                      0.5949015
  #
  #Alternative Hypothesis:          At least one group
  #                                 does not come from a
  #                                 Lognormal Distribution.

  #----------
  # Clean up
  rm(gofGroup.list)
  graphics.off()

[Package EnvStats version 2.8.1 Index]