gofTest {EnvStats} | R Documentation |
Goodness-of-Fit Test
Description
Perform a goodness-of-fit test to determine whether a data set appears to come from a specified probability distribution or if two data sets appear to come from the same distribution.
Usage
gofTest(y, ...)
## S3 method for class 'formula'
gofTest(y, data = NULL, subset,
na.action = na.pass, ...)
## Default S3 method:
gofTest(y, x = NULL,
test = ifelse(is.null(x), "sw", "ks"),
distribution = "norm", est.arg.list = NULL,
alternative = "two.sided", n.classes = NULL,
cut.points = NULL, param.list = NULL,
estimate.params = ifelse(is.null(param.list), TRUE, FALSE),
n.param.est = NULL, correct = NULL, digits = .Options$digits,
exact = NULL, ws.method = "normal scores", warn = TRUE, keep.data = TRUE,
data.name = NULL, data.name.x = NULL, parent.of.data = NULL,
subset.expression = NULL, ...)
Arguments
y |
an object containing data for the goodness-of-fit test. In the default
method, the argument |
data |
specifies an optional data frame, list or environment (or object coercible
by |
subset |
specifies an optional vector specifying a subset of observations to be used. |
na.action |
specifies a function which indicates what should happen when the data contain |
x |
numeric vector of values for the first sample in the case of a two-sample
Kolmogorov-Smirnov goodness-of-fit test ( |
test |
character string defining which goodness-of-fit test to perform. Possible values are:
When the argument |
distribution |
a character string denoting the distribution abbreviation. See the help file for
When When When When When When |
est.arg.list |
a list of arguments to be passed to the function estimating the distribution parameters.
For example, if When When When When |
alternative |
for the case when |
n.classes |
for the case when |
cut.points |
for the case when |
param.list |
for the case when |
estimate.params |
for the case when |
n.param.est |
for the case when |
correct |
for the case when |
digits |
for the case when |
exact |
for the case when |
ws.method |
for the case when |
warn |
logical scalar indicating whether to print a warning message when
observations with |
keep.data |
logical scalar indicating whether to return the data used for the goodness-of-fit test.
The default value is |
data.name |
character string indicating the name of the data used for argument |
data.name.x |
character string indicating the name of the data used for argument |
parent.of.data |
character string indicating the source of the data used for the goodness-of-fit test. |
subset.expression |
character string indicating the expression used to subset the data. |
... |
additional arguments affecting the goodness-of-fit test. |
Details
-
Shapiro-Wilk Goodness-of-Fit Test (
test="sw"
).The Shapiro-Wilk goodness-of-fit test (Shapiro and Wilk, 1965; Royston, 1992a) is one of the most commonly used goodness-of-fit tests for normality. You can use it to test the following hypothesized distributions: Normal, Lognormal, Three-Parameter Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta). In addition, you can also use it to test the null hypothesis of any continuous distribution that is available (see the help file for
Distribution.df
, and see explanation below).
Shapiro-Wilk W-Statistic and P-Value for Testing Normality
LetX
denote a random variable with cumulative distribution function (cdf)F
. Suppose we want to test the null hypothesis thatF
is the cdf of a normal (Gaussian) distribution with some arbitrary mean\mu
and standard deviation\sigma
against the alternative hypothesis thatF
is the cdf of some other distribution. The table below shows the random variable for whichF
is the assumed cdf, given the value of the argumentdistribution
.Value of Random Variable for distribution
Distribution Name which F
is the cdf"norm"
Normal X
"lnorm"
Lognormal (Log-space) log(X)
"lnormAlt"
Lognormal (Untransformed) log(X)
"lnorm3"
Three-Parameter Lognormal log(X-\gamma)
"zmnorm"
Zero-Modified Normal X | X > 0
"zmlnorm"
Zero-Modified Lognormal (Log-space) log(X) | X > 0
"zmlnormAlt"
Zero-Modified Lognormal (Untransformed) log(X) | X > 0
Note that for the three-parameter lognormal distribution, the symbol
\gamma
denotes the threshold parameter.Let
\underline{x} = (x_1, x_2, \ldots, x_n)
denote the vector ofn
ordered observations assumed to come from a normal distribution.
The Shapiro-Wilk W-Statistic
Shapiro and Wilk (1965) introduced the following statistic to test the null hypothesis thatF
is the cdf of a normal distribution:W = \frac{(\sum_{i=1}^n a_i x_i)^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \;\;\;\;\;\; (1)
where the quantity
a_i
is thei
'th element of the vector\underline{a}
defined by:\underline{a} = \frac{\underline{m}^T V^{-1}}{[\underline{m}^T V^{-1} V^{-1} \underline{m}]^{1/2}} \;\;\;\;\;\; (2)
where
T
denotes the transpose operator, and\underline{m}
is the vector of expected values andV
is the variance-covariance matrix of the order statistics of a random sample of sizen
from a standard normal distribution. That is, the values of\underline{a}
are the expected values of the standard normal order statistics weighted by their variance-covariance matrix, and normalized so that\underline{a}^T \underline{a} = 1 \;\;\;\;\;\; (3)
It can be shown that the coefficients
\underline{a}
are antisymmetric, that is,a_i = -a_{n-i+1} \;\;\;\;\;\; (4)
and for odd
n
,a_{(n+1)/2} = 0 \;\;\;\;\;\; (5)
Now because
\bar{a} = \frac{1}{n} \sum_{i=1}^n a_i = 0 \;\;\;\;\;\ (6)
and
\sum_{i=1}^n (a_i - \bar{a})^2 = \sum_{i=1}^n a_i^2 = \underline{a}^T \underline{a} = 1 \;\;\;\;\;\; (7)
the
W
-statistic in Equation (1) is the same as the square of the sample product-moment correlation between the vectors\underline{a}
and\underline{x}
:W = r(\underline{a}, \underline{x})^2 \;\;\;\;\;\; (8)
where
r(\underline{x}, \underline{y}) = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{[\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2]^{1/2}} \;\;\;\;\;\;\; (9)
(see the R help file for
cor
).The Shapiro-Wilk
W
-statistic is also simply the ratio of two estimators of variance, and can be rewritten asW = \frac{\hat{\sigma}_{BLUE}^2}{\hat{\sigma}_{MVUE}^2} \;\;\;\;\;\; (10)
where the numerator is the square of the best linear unbiased estimate (BLUE) of the standard deviation, and the denominator is the minimum variance unbiased estimator (MVUE) of the variance:
\hat{\sigma}_{BLUE} = \frac{\sum_{i=1}^n a_i x_i}{\sqrt{n-1}} \;\;\;\;\;\; (11)
\hat{\sigma}_{MVUE}^2 = \frac{\sum_{i=1}^n (x_i - \bar{x})^2}{n-1} \;\;\;\;\;\; (12)
Small values of
W
indicate the null hypothesis is probably not true. Shapiro and Wilk (1965) computed the values of the coefficients\underline{a}
and the percentage points forW
(based on smoothing the empirical null distribution ofW
) for sample sizes up to 50. Computation of theW
-statistic for larger sample sizes can be cumbersome, since computation of the coefficients\underline{a}
requires storage of at leastn + [n(n+1)/2]
reals followed byn \times n
matrix inversion (Royston, 1992a).
The Shapiro-Francia W'-Statistic
Shapiro and Francia (1972) introduced a modification of theW
-test that depends only on the expected values of the order statistics (\underline{m}
) and not on the variance-covariance matrix (V
):W' = \frac{(\sum_{i=1}^n b_i x_i)^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \;\;\;\;\;\; (13)
where the quantity
b_i
is thei
'th element of the vector\underline{b}
defined by:\underline{b} = \frac{\underline{m}}{[\underline{m}^T \underline{m}]^{1/2}} \;\;\;\;\;\; (14)
Several authors, including Ryan and Joiner (1973), Filliben (1975), and Weisberg and Bingham (1975), note that the
W'
-statistic is intuitively appealing because it is the squared Pearson correlation coefficient associated with a normal probability plot. That is, it is the squared correlation between the ordered sample values\underline{x}
and the expected normal order statistics\underline{m}
:W' = r(\underline{b}, \underline{x})^2 = r(\underline{m}, \underline{x})^2 \;\;\;\;\;\; (15)
Shapiro and Francia (1972) present a table of empirical percentage points for
W'
based on a Monte Carlo simulation. It can be shown that the asymptotic null distributions ofW
andW'
are identical, but convergence is very slow (Verrill and Johnson, 1988).
The Weisberg-Bingham Approximation to the W'-Statistic
Weisberg and Bingham (1975) introduced an approximation of the Shapiro-FranciaW'
-statistic that is easier to compute. They suggested using Blom scores (Blom, 1958, pp.68–75) to approximate the element of\underline{m}
:\tilde{W}' = \frac{(\sum_{i=1}^n c_i x_i)^2}{\sum_{i=1}^n (x_i - \bar{x})^2} \;\;\;\;\;\; (16)
where the quantity
c_i
is thei
'th element of the vector\underline{c}
defined by:\underline{c} = \frac{\underline{\tilde{m}}}{[\underline{\tilde{m}}^T \underline{\tilde{m}}]^{1/2}} \;\;\;\;\;\; (17)
and
\tilde{m}_i = \Phi^{-1}[\frac{i - (3/8)}{n + (1/4)}] \;\;\;\;\;\; (18)
and
\Phi
denotes the standard normal cdf. That is, the values of the elements of\underline{m}
in Equation (14) are replaced with their estimates based on the usual plotting positions for a normal distribution.
Royston's Approximation to the Shapiro-Wilk W-Test
Royston (1992a) presents an approximation for the coefficients\underline{a}
necessary to compute the Shapiro-WilkW
-statistic, and also a transformation of theW
-statistic that has approximately a standard normal distribution under the null hypothesis.Noting that, up to a constant, the components of
\underline{b}
in Equation (14) and\underline{c}
in Equation (17) differ from those of\underline{a}
in Equation (2) mainly in the first and last two components, Royston (1992a) used the approximation\underline{c}
as the basis for approximating\underline{a}
using polynomial (quintic) regression analysis. For4 \le n \le 1000
, the approximation gave the following equations for the last two (and hence first two) components of\underline{a}
:\tilde{a}_n = c_n + 0.221157 y - 0.147981 y^2 - 2.071190 y^3 + 4.434685 y^4 - 2.706056 y^5 \;\;\;\;\;\; (19)
\tilde{a}_{n-1} = c_{n-1} + 0.042981 y - 0.293762 y^2 - 1.752461 y^3 + 5.682633 y^4 - 3.582633 y^5 \;\;\;\;\;\; (20)
where
y = \sqrt{n} \;\;\;\;\;\; (21)
The other components are computed as:
\tilde{a}_i = \frac{\tilde{m}_i}{\sqrt{\eta}} \;\;\;\;\;\; (22)
for
i = 2, \ldots , n-1
ifn \le 5
, ori = 3, \ldots, n-2
ifn > 5
, where\eta = \frac{\underline{\tilde{m}}^T \underline{\tilde{m}} - 2 \tilde{m}_n^2}{1 - 2 \tilde{a}_n^2} \;\;\;\;\;\; (23)
if
n \le 5
, and\eta = \frac{\underline{\tilde{m}}^T \underline{\tilde{m}} - 2 \tilde{m}_n^2 - 2 \tilde{m}_{n-1}^2}{1 - 2 \tilde{a}_n^2 - 2 \tilde{a}_{n-1}^2} \;\;\;\;\;\; (24)
if
n > 5
.Royston (1992a) found his approximation to
\underline{a}
to be accurate to at least\pm 1
in the third decimal place over all values ofi
and selected values ofn
, and also found that critical percentage points ofW
based on his approximation agreed closely with the exact critical percentage points calculated by Verrill and Johnson (1988).
Transformation of the Null Distribution of W to Normality
In order to compute a p-value associated with a particular value ofW
, Royston (1992a) approximated the distribution of(1-W)
by a three-parameter lognormal distribution for4 \le n \le 11
, and the upper half of the distribution of(1-W)
by a two-parameter lognormal distribution for12 \le n \le 2000
. Settingz = \frac{w - \mu}{\sigma} \;\;\;\;\;\; (25)
the p-value associated with
W
is given by:p = 1 - \Phi(z) \;\;\;\;\;\; (26)
For
4 \le n \le 11
, the quantities necessary to computez
are given by:w = -log[\gamma - log(1 - W)] \;\;\;\;\;\; (27)
\gamma = -2.273 + 0.459 n \;\;\;\;\;\; (28)
\mu = 0.5440 - 0.39978 n + 0.025054 n^2 - 0.000671 n^3 \;\;\;\;\;\; (29)
\sigma = exp(1.3822 - 0.77857 n + 0.062767 n^2 - 0.0020322 n^3) \;\;\;\;\;\; (30)
For
12 \le n \le 2000
, the quantities necessary to computez
are given by:w = log(1 - W) \;\;\;\;\;\; (31)
\gamma = log(n) \;\;\;\;\;\; (32)
\mu = -1.5861 - 0.31082 y - 0.083751 y^2 + 0.00038915 y^3 \;\;\;\;\;\; (33)
\sigma = exp(-0.4803 - 0.082676 y + 0.0030302 y^2) \;\;\;\;\;\; (34)
For the last approximation when
12 \le n \le 2000
, Royston (1992a) claims this approximation is actually valid for sample sizes up ton = 5000
.
Modification for the Three-Parameter Lognormal Distribution
Whendistribution="lnorm3"
, the functiongofTest
assumes the vector\underline{x}
is a random sample from a three-parameter lognormal distribution. It estimates the threshold parameter via the zero-skewness method (seeelnorm3
), and then performs the Shapiro-Wilk goodness-of-fit test for normality onlog(x-\hat{\gamma})
where\hat{\gamma}
is the estimated threshold parmater. Because the threshold parameter has to be estimated, however, the p-value associated with the computed z-statistic will tend to be conservative (larger than it should be under the null hypothesis). Royston (1992b) proposed the following transformation of the z-statistic:z' = \frac{z - \mu_z}{\sigma_z} \;\;\;\;\;\; (35)
where for
5 \le n \le 11
,\mu_z = -3.8267 + 2.8242 u - 0.63673 u^2 - 0.020815 v \;\;\;\;\;\; (36)
\sigma_z = -4.9914 + 8.6724 u - 4.27905 u^2 + 0.70350 u^3 - 0.013431 v \;\;\;\;\;\; (37)
and for
12 \le n \le 2000
,\mu_z = -3.7796 + 2.4038 u - 0.6675 u^2 - 0.082863 u^3 - 0.0037935 u^4 - 0.027027 v - 0.0019887 vu \;\;\;\;\;\; (38)
\sigma_z = 2.1924 - 1.0957 u + 0.33737 u^2 - 0.043201 u^3 + 0.0019974 u^4 - 0.0053312 vu \;\;\;\;\;\; (39)
where
u = log(n) \;\;\;\;\;\; (40)
v = u (\hat{\sigma} - \hat{\sigma}^2) \;\;\;\;\;\; (41)
\hat{\sigma}^2 = \frac{1}{n-1} \sum_{i=1}^n (y_i - \bar{y})^2 \;\;\;\;\;\; (42)
y_i = log(x_i - \hat{\gamma}) \;\;\;\;\;\; (43)
and
\gamma
denotes the threshold parameter. The p-value associated with this test is then given by:p = 1 - \Phi(z') \;\;\;\;\;\; (44)
Testing Goodness-of-Fit for Any Continuous Distribution
The functiongofTest
extends the Shapiro-Wilk test to test for goodness-of-fit for any continuous distribution by using the idea of Chen and Balakrishnan (1995), who proposed a general purpose approximate goodness-of-fit test based on the Cramer-von Mises or Anderson-Darling goodness-of-fit tests for normality. The functiongofTest
modifies the approach of Chen and Balakrishnan (1995) by using the same first 2 steps, and then applying the Shapiro-Wilk test:Let
\underline{x} = x_1, x_2, \ldots, x_n
denote the vector ofn
ordered observations. Compute cumulative probabilities for eachx_i
based on the cumulative distribution function for the hypothesized distribution. That is, computep_i = F(x_i, \hat{\theta})
whereF(x, \theta)
denotes the hypothesized cumulative distribution function with parameter(s)\theta
, and\hat{\theta}
denotes the estimated parameter(s).Compute standard normal deviates based on the computed cumulative probabilities:
y_i = \Phi^{-1}(p_i)
Perform the Shapiro-Wilk goodness-of-fit test on the
y_i
's.
-
Shapiro-Francia Goodness-of-Fit Test (
test="sf"
).The Shapiro-Francia goodness-of-fit test (Shapiro and Francia, 1972; Weisberg and Bingham, 1975; Royston, 1992c) is also one of the most commonly used goodness-of-fit tests for normality. You can use it to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta). In addition, you can also use it to test the null hypothesis of any continuous distribution that is available (see the help file for
Distribution.df
). See the section Testing Goodness-of-Fit for Any Continuous Distribution above for an explanation of how this is done.
Royston's Transformation of the Shapiro-Francia W'-Statistic to Normality
Equation (13) above gives the formula for the Shapiro-Francia W'-statistic, and Equation (16) above gives the formula for Weisberg-Bingham approximation to the W'-statistic (denoted\tilde{W}'
). Royston (1992c) presents an algorithm to transform the\tilde{W}'
-statistic so that its null distribution is approximately a standard normal. For5 \le n \le 5000
, Royston (1992c) approximates the distribution of(1-\tilde{W}')
by a lognormal distribution. Settingz = \frac{w-\mu}{\sigma} \;\;\;\;\;\; (45)
the p-value associated with
\tilde{W}'
is given by:p = 1 - \Phi(z) \;\;\;\;\;\; (46)
The quantities necessary to compute
z
are given by:w = log(1 - \tilde{W}') \;\;\;\;\;\; (47)
\nu = log(n) \;\;\;\;\;\; (48)
u = log(\nu) - \nu \;\;\;\;\;\; (49)
\mu = -1.2725 + 1.0521 u \;\;\;\;\;\; (50)
v = log(\nu) + \frac{2}{\nu} \;\;\;\;\;\; (51)
\sigma = 1.0308 - 0.26758 v \;\;\;\;\;\; (52)
Testing Goodness-of-Fit for Any Continuous Distribution
The functiongofTest
extends the Shapiro-Francia test to test for goodness-of-fit for any continuous distribution by using the idea of Chen and Balakrishnan (1995), who proposed a general purpose approximate goodness-of-fit test based on the Cramer-von Mises or Anderson-Darling goodness-of-fit tests for normality. The functiongofTest
modifies the approach of Chen and Balakrishnan (1995) by using the same first 2 steps, and then applying the Shapiro-Francia test:Let
\underline{x} = x_1, x_2, \ldots, x_n
denote the vector ofn
ordered observations. Compute cumulative probabilities for eachx_i
based on the cumulative distribution function for the hypothesized distribution. That is, computep_i = F(x_i, \hat{\theta})
whereF(x, \theta)
denotes the hypothesized cumulative distribution function with parameter(s)\theta
, and\hat{\theta}
denotes the estimated parameter(s).Compute standard normal deviates based on the computed cumulative probabilities:
y_i = \Phi^{-1}(p_i)
Perform the Shapiro-Francia goodness-of-fit test on the
y_i
's.
-
Probability Plot Correlation Coefficient (PPCC) Goodness-of-Fit Test (
test="ppcc"
).The PPPCC goodness-of-fit test (Filliben, 1975; Looney and Gulledge, 1985) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta). In addition, you can also use it to test the null hypothesis of any continuous distribution that is available (see the help file for
Distribution.df
). The functiongofTest
computes the PPCC test statistic using Blom plotting positions.Filliben (1975) proposed using the correlation coefficient
r
from a normal probability plot to perform a goodness-of-fit test for normality, and he provided a table of critical values forr
under the for samples sizes between 3 and 100. Vogel (1986) provided an additional table for sample sizes between 100 and 10,000.Looney and Gulledge (1985) investigated the characteristics of Filliben's probability plot correlation coefficient (PPCC) test using the plotting position formulas given in Filliben (1975), as well as three other plotting position formulas: Hazen plotting positions, Weibull plotting positions, and Blom plotting positions (see the help file for
qqPlot
for an explanation of these plotting positions). They concluded that the PPCC test based on Blom plotting positions performs slightly better than tests based on other plotting positions, and they provide a table of empirical percentage points for the distribution ofr
based on Blom plotting positions.The function
gofTest
computes the PPCC test statisticr
using Blom plotting positions. It can be shown that the square of this statistic is equivalent to the Weisberg-Bingham Approximation to the Shapiro-Francia W'-Test (Weisberg and Bingham, 1975; Royston, 1993). Thus the PPCC goodness-of-fit test is equivalent to the Shapiro-Francia goodness-of-fit test.
-
Anderson-Darling Goodness-of-Fit Test (
test="ad"
).The Anderson-Darling goodness-of-fit test (Stephens, 1986a; Thode, 2002) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta).
When
test="ad"
, the functiongofTest
calls the functionad.test
in the package nortest. Documentation from that package is as follows:The Anderson-Darling test is an EDF omnibus test for the composite hypothesis of normality. The test statistic is:
A = -n - \frac{1}{n} \sum_{i=1}^n [2i - 1][ln(p_{(i)}) + ln(1 - p_{(n-i+1)})]
where
p_{(i)} = \Phi([x_{(i)} - \bar{x}]/s)
. Here,\Phi
is the cumulative distribution function of the standard normal distribution, and\bar{x}
ands
are mean and standard deviation of the data values. The p-value is computed from the modified statisticZ = A (1.0 + 0.75/n + 2.25/n^2)
according to Table 4.9 in Stephens [(1986a)].
-
Cramer-von Mises Goodness-of-Fit Test (
test="cvm"
).The Cramer-von Mises goodness-of-fit test (Stephens, 1986a; Thode, 2002) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta).
When
test="cvm"
, the functiongofTest
calls the functioncvm.test
in the package nortest. Documentation from that package is as follows:The Cramer-von Mises test is an EDF omnibus test for the composite hypothesis of normality. The test statistic is:
W = \frac{1}{12n} + \sum_{i=1}^n \left(p_{(i)} - \frac{2i-1}{2n}\right)^2
where
p_{(i)} = \Phi([x_{(i)} - \bar{x}]/s)
. Here,\Phi
is the cumulative distribution function of the standard normal distribution, and\bar{x}
ands
are mean and standard deviation of the data values. The p-value is computed from the modified statisticZ = W (1.0 + 0.75/n)
according to Table 4.9 in Stephens [(1986a)].
-
Lilliefors Goodness-of-Fit Test (
test="lillie"
).The Lilliefors goodness-of-fit test (Stephens, 1974; Dallal and Wilkinson, 1986; Thode, 2002) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta).
When
test="lillie"
, the functiongofTest
calls the functionlillie.test
in the package nortest. Documentation from that package is as follows:The Lilliefors (Kolmogorov-Smirnov) test is an EDF omnibus test for the composite hypothesis of normality. The test statistic is the maximal absolute difference between empirical and hypothetical cumulative distribution function. It may be computed as
D = max\{D^+, D^-\}
withD^+ = \max_{i = 1, \ldots, n} \{i/n - p_{(i)}\}, \;\; D^- = \max_{i = 1, \ldots, n} \{p_{(i)} - (i-1)/n\}
where
p_{(i)} = \Phi([x_{(i)} - \bar{x}]/s)
. Here,\Phi
is the cumulative distribution function of the standard normal distribution, and\bar{x}
ands
are mean and standard deviation of the data values. The p-value is computed from the Dallal-Wilkinson (1986) formula, which is claimed to be only reliable when the p-value is smaller than 0.1. If the Dallal-Wilkinson p-value turns out to be greater than 0.1, then the p-value is computed from the distribution of the modified statisticZ = D (\sqrt{n} - 0.01 + 0.85/\sqrt{n})
, see Stephens (1974), the actual p-value formula being obtained by a simulation and approximation process.
-
Zero-Skew Goodness-of-Fit Test (
test="skew"
).The Zero-skew goodness-of-fit test (D'Agostino, 1970) can be used to test the following hypothesized distributions: Normal, Lognormal, Zero-Modified Normal, or Zero-Modified Lognormal (Delta).
When
test="skew"
, the functiongofTest
tests the null hypothesis that the skew of the distribution is 0:H_0: \sqrt{\beta}_1 = 0 \;\;\;\;\;\; (53)
where
\sqrt{\beta}_1 = \frac{\mu_3}{\mu_2^{3/2}} \;\;\;\;\;\; (54)
and the quantity
\mu_r
denotes ther
'th moment about the mean (also called ther
'th central moment). The quantity\sqrt{\beta_1}
is called the coefficient of skewness, and is estimated by:\sqrt{b}_1 = \frac{m_3}{m_2^{3/2}} \;\;\;\;\;\; (55)
where
m_r = \frac{1}{n} \sum_{i=1}^n (x_i - \bar{x})^r \;\;\;\;\;\; (56)
denotes the
r
'th sample central moment.The possible alternative hypotheses are:
H_a: \sqrt{\beta}_1 \ne 0 \;\;\;\;\;\; (57)
H_a: \sqrt{\beta}_1 < 0 \;\;\;\;\;\; (58)
H_a: \sqrt{\beta}_1 > 0 \;\;\;\;\;\; (59)
which correspond to
alternative="two-sided"
,alternative="less"
, and
alternative="greater"
, respectively.To test the null hypothesis of zero skew, D'Agostino (1970) derived an approximation to the distribution of
\sqrt{b_1}
under the null hypothesis of zero-skew, assuming the observations comprise a random sample from a normal (Gaussian) distribution. Based on D'Agostino's approximation, the statisticZ
shown below is assumed to follow a standard normal distribution and is used to compute the p-value associated with the test ofH_0
:Z = \delta \;\; log\{ \frac{Y}{\alpha} + [(\frac{Y}{\alpha})^2 + 1]^{1/2} \} \;\;\;\;\;\; (60)
where
Y = \sqrt{b_1} [\frac{(n+1)(n+3)}{6(n-2)}]^{1/2} \;\;\;\;\;\; (61)
\beta_2 = \frac{3(n^2 + 27n - 70)(n+1)(n+3)}{(n-2)(n+5)(n+7)(n+9)} \;\;\;\;\;\; (62)
W^2 = -1 + \sqrt{2\beta_2 - 2} \;\;\;\;\;\; (63)
\delta = 1 / \sqrt{log(W)} \;\;\;\;\;\; (64)
\alpha = [2 / (W^2 - 1)]^{1/2} \;\;\;\;\;\; (65)
When the sample size
n
is at least 150, a simpler approximation may be used in whichY
in Equation (61) is assumed to follow a standard normal distribution and is used to compute the p-value associated with the hypothesis test.
-
Kolmogorov-Smirnov Goodness-of-Fit Test (
test="ks"
).When
test="ks"
, the functiongofTest
calls the R functionks.test
to compute the test statistic and p-value. Note that for the one-sample case, the distribution parameters should be pre-specified and not estimated from the data, and if the distribution parameters are estimated from the data you will receive a warning that this test is very conservative (Type I error smaller than assumed; high Type II error) in this case.
-
ProUCL Kolmogorov-Smirnov Goodness-of-Fit Test for Gamma (
test="proucl.ks.gamma"
).When
test="proucl.ks.gamma"
, the functiongofTest
calls the R functionks.test
to compute the Kolmogorov-Smirnov test statistic based on the maximum likelihood estimates of the shape and scale parameters (seeegamma
). The p-value is computed based on the simulated critical values given inProUCL.Crit.Vals.for.KS.Test.for.Gamma.array
(USEPA, 2015). The sample size must be between 5 and 1000, and the value of the maximum likelihood estimate of the shape parameter must be between 0.025 and 50. The critical value for the test statistic is computed using the simulated critical values and linear interpolation.
-
ProUCL Anderson-Darling Goodness-of-Fit Test for Gamma (
test="proucl.ad.gamma"
).When
test="proucl.ad.gamma"
, the functiongofTest
computes the Anderson-Darling test statistic (Stephens, 1986a, p.101) based on the maximum likelihood estimates of the shape and scale parameters (seeegamma
). The p-value is computed based on the simulated critical values given inProUCL.Crit.Vals.for.AD.Test.for.Gamma.array
(USEPA, 2015). The sample size must be between 5 and 1000, and the value of the maximum likelihood estimate of the shape parameter must be between 0.025 and 50. The critical value for the test statistic is computed using the simulated critical values and linear interpolation.
-
Chi-Squared Goodness-of-Fit Test (
test="chisq"
).The method used by
gofTest
is a modification of what is used forchisq.test
. If the hypothesized distribution function is completely specified, the degrees of freedom arem-1
wherem
denotes the number of classes. If any parameters are estimated, the degrees of freedom depend on the method of estimation. The functiongofTest
follows the convention of computing degrees of freedom asm-1-k
, wherek
is the number of parameters estimated. It can be shown that if the parameters are estimated by maximum likelihood, the degrees of freedom are bounded betweenm-1
andm-1-k
. Therefore, especially when the sample size is small, it is important to compare the test statistic to the chi-squared distribution with bothm-1
andm-1-k
degrees of freedom. See Kendall and Stuart (1991, Chapter 30) for a more complete discussion.The distribution theory of chi-square statistics is a large sample theory. The expected cell counts are assumed to be at least moderately large. As a rule of thumb, each should be at least 5. Although authors have found this rule to be conservative (especially when the class probabilities are not too different from each other), the user should regard p-values with caution when expected cell counts are small.
-
Wilk-Shapiro Goodness-of-Fit Test for Uniform [0, 1] Distribution (
test="ws"
).Wilk and Shapiro (1968) suggested this test in the context of jointly testing several independent samples for normality simultaneously. If
p_1, p_2, \ldots, p_n
denote the p-values associated with the test for normality ofn
independent samples, then under the null hypothesis that alln
samples come from a normal distribution, the p-values are a random sample ofn
observations from a Uniform [0,1] distribution, that is a Uniform distribution with minimum 0 and maximum 1. Wilk and Shapiro (1968) suggested two different methods for testing whether the p-values come from a Uniform [0, 1] distribution:-
Test Based on Normal Scores. Under the null hypothesis, the normal scores
\Phi^{-1}(p_1), \Phi^{-1}(p_2), \ldots, \Phi^{-1}(p_n)
are a random sample of
n
observations from a standard normal distribution. Wilk and Shapiro (1968) denote thei
'th normal score byG_i = \Phi^{-1}(p_i) \;\;\;\;\;\; (66)
and note that under the null hypothesis, the quantity
G
defined asG = \frac{1}{\sqrt{n}} \, \sum^n_{1}{G_i} \;\;\;\;\;\; (67)
has a standard normal distribution. Wilk and Shapiro (1968) were interested in the alternative hypothesis that some of the
n
independent samples did not come from a normal distribution and hence would be associated with smaller p-values than expected under the null hypothesis, which translates to the alternative that the cdf for the distribution of the p-values is greater than the cdf of a Uniform [0, 1] distribution (alternative="greater"
). In terms of the test statisticG
, this alternative hypothesis would tend to makeG
smaller than expected, so the p-value is given by\Phi(G)
. For the one-sided lower alternative that the cdf for the distribution of p-values is less than the cdf for a Uniform [0, 1] distribution, the p-value is given byp = 1 - \Phi(G) \;\;\;\;\;\; (68)
.
-
Test Based on Chi-Square Scores. Under the null hypothesis, the chi-square scores
-2 \, log(p_1), -2 \, log(p_2), \ldots, -2 \, log(p_n)
are a random sample of
n
observations from a chi-square distribution with 2 degrees of freedom (Fisher, 1950). Wilk and Shapiro (1968) denote thei
'th chi-square score byC_i = -2 \, log(p_i) \;\;\;\;\;\; (69)
and note that under the null hypothesis, the quantity
C
defined asC = \sum^n_{1}{C_i} \;\;\;\;\;\; (70)
has a chi-square distribution with
2n
degrees of freedom. Wilk and Shapiro (1968) were interested in the alternative hypothesis that some of then
independent samples did not come from a normal distribution and hence would be associated with smaller p-values than expected under the null hypothesis, which translates to the alternative that the cdf for the distribution of the p-values is greater than the cdf of a Uniform [0, 1] distribution (alternative="greater"
). In terms of the test statisticC
, this alternative hypothesis would tend to makeC
larger than expected, so the p-value is given byp = 1 - F_{2n}(C) \;\;\;\;\;\; (71)
where
F_2n
denotes the cumulative distribution function of the chi-square distribution with2n
degrees of freedom. For the one-sided lower alternative that the cdf for the distribution of p-values is less than the cdf for a Uniform [0, 1] distribution, the p-value is given byp = F_{2n}(C) \;\;\;\;\;\; (72)
-
Value
a list of class "gof"
containing the results of the goodness-of-fit test, unless
the two-sample
Kolmogorov-Smirnov test is used, in which case the value is a list of
class "gofTwoSample"
. Objects of class "gof"
and "gofTwoSample"
have special printing and plotting methods. See the help files for gof.object
and gofTwoSample.object
for details.
Note
The Shapiro-Wilk test (Shapiro and Wilk, 1965) and the Shapiro-Francia test (Shapiro and Francia, 1972) are probably the two most commonly used hypothesis tests to test departures from normality. The Shapiro-Wilk test is most powerful at detecting short-tailed (platykurtic) and skewed distributions, and least powerful against symmetric, moderately long-tailed (leptokurtic) distributions. Conversely, the Shapiro-Francia test is more powerful against symmetric long-tailed distributions and less powerful against short-tailed distributions (Royston, 1992b; 1993). In general, the Shapiro-Wilk and Shapiro-Francia tests outperform the Anderson-Darling test, which in turn outperforms the Cramer-von Mises test, which in turn outperforms the Lilliefors test (Stephens, 1986a; Razali and Wah, 2011; Romao et al., 2010).
The zero-skew goodness-of-fit test for normality is one of several tests that have
been proposed to test the assumption of a normal distribution (D'Agostino, 1986b).
This test has been included mainly because it is called by elnorm3
.
Ususally, the Shapiro-Wilk or Shapiro-Francia test is preferred to this test, unless
the direction of the alternative to normality (e.g., positive skew) is known
(D'Agostino, 1986b, pp. 405–406).
Kolmogorov (1933) introduced a goodness-of-fit test to test the hypothesis that a
random sample of n
observations x comes from a specific hypothesized distribution
with cumulative distribution function H
. This test is now usually called the
one-sample Kolmogorov-Smirnov goodness-of-fit test. Smirnov (1939) introduced a
goodness-of-fit test to test the hypothesis that a random sample of n
observations x comes from the same distribution as a random sample of
m
observations y. This test is now usually called the two-sample
Kolmogorov-Smirnov goodness-of-fit test. Both tests are based on the maximum
vertical distance between two cumulative distribution functions. For the one-sample problem
with a small sample size, the Kolmogorov-Smirnov test may be preferred over the chi-squared
goodness-of-fit test since the KS-test is exact, while the chi-squared test is based on
an asymptotic approximation.
The chi-squared test, introduced by Pearson in 1900, is the oldest and best known goodness-of-fit test. The idea is to reduce the goodness-of-fit problem to a multinomial setting by comparing the observed cell counts with their expected values under the null hypothesis. Grouping the data sacrifices information, especially if the hypothesized distribution is continuous. On the other hand, chi-squared tests can be be applied to any type of variable: continuous, discrete, or a combination of these.
The Wilk-Shapiro (1968) tests for a Uniform [0, 1] distribution were introduced in the context
of testing whether several independent samples all come from normal distributions, with
possibly different means and variances. The function gofGroupTest
extends
this idea to allow you to test whether several independent samples come from the same
distribution (e.g., gamma, extreme value, etc.), with possibly different parameters.
In practice, almost any goodness-of-fit test will not reject the null hypothesis
if the number of observations is relatively small. Conversely, almost any goodness-of-fit
test will reject the null hypothesis if the number of observations is very large,
since “real” data are never distributed according to any theoretical distribution
(Conover, 1980, p.367). For most cases, however, the distribution of “real” data
is close enough to some theoretical distribution that fairly accurate results may be
provided by assuming that particular theoretical distribution. One way to asses the
goodness of the fit is to use goodness-of-fit tests. Another way is to look at
quantile-quantile (Q-Q) plots (see qqPlot
).
Author(s)
Steven P. Millard (EnvStats@ProbStatInfo.com)
Juergen Gross and Uwe Ligges for the Anderson-Darling, Carmer-von Mises, and Lilliefors tests called from the package nortest.
References
Birnbaum, Z.W., and F.H. Tingey. (1951). One-Sided Confidence Contours for Probability Distribution Functions. Annals of Mathematical Statistics 22, 592-596.
Blom, G. (1958). Statistical Estimates and Transformed Beta Variables. John Wiley and Sons, New York.
Conover, W.J. (1980). Practical Nonparametric Statistics. Second Edition. John Wiley and Sons, New York.
Dallal, G.E., and L. Wilkinson. (1986). An Analytic Approximation to the Distribution of Lilliefor's Test for Normality. The American Statistician 40, 294-296.
D'Agostino, R.B. (1970). Transformation to Normality of the Null Distribution of g1
.
Biometrika 57, 679-681.
D'Agostino, R.B. (1971). An Omnibus Test of Normality for Moderate and Large Size Samples. Biometrika 58, 341-348.
D'Agostino, R.B. (1986b). Tests for the Normal Distribution. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York.
D'Agostino, R.B., and E.S. Pearson (1973). Tests for Departures from Normality.
Empirical Results for the Distributions of b2
and \sqrt{b1}
.
Biometrika 60(3), 613-622.
D'Agostino, R.B., and G.L. Tietjen (1973). Approaches to the Null Distribution of \sqrt{b1}
.
Biometrika 60(1), 169-173.
Fisher, R.A. (1950). Statistical Methods for Research Workers. 11'th Edition. Hafner Publishing Company, New York, pp.99-100.
Gibbons, R.D., D.K. Bhaumik, and S. Aryal. (2009). Statistical Methods for Groundwater Monitoring, Second Edition. John Wiley & Sons, Hoboken.
Kendall, M.G., and A. Stuart. (1991). The Advanced Theory of Statistics, Volume 2: Inference and Relationship. Fifth Edition. Oxford University Press, New York.
Kim, P.J., and R.I. Jennrich. (1973). Tables of the Exact Sampling Distribution of the Two Sample Kolmogorov-Smirnov Criterion. In Harter, H.L., and D.B. Owen, eds. Selected Tables in Mathematical Statistics, Vol. 1. American Mathematical Society, Providence, Rhode Island, pp.79-170.
Kolmogorov, A.N. (1933). Sulla determinazione empirica di una legge di distribuzione. Giornale dell' Istituto Italiano degle Attuari 4, 83-91.
Marsaglia, G., W.W. Tsang, and J. Wang. (2003). Evaluating Kolmogorov's distribution. Journal of Statistical Software, 8(18). doi:10.18637/jss.v008.i18.
Moore, D.S. (1986). Tests of Chi-Squared Type. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, pp.63-95.
Pomeranz, J. (1973). Exact Cumulative Distribution of the Kolmogorov-Smirnov Statistic for Small Samples (Algorithm 487). Collected Algorithms from ACM ??, ???-???.
Razali, N.M., and Y.B. Wah. (2011). Power Comparisons of Shapiro-Wilk, Kolmogorov-Smirnov, Lilliefors, and Anderson-Darling Tests. Journal of Statistical Modeling and Analytics 2(1), 21–33.
Romao, X., Delgado, R., and A. Costa. (2010). An Empirical Power Comparison of Univariate Goodness-of-Fit Tests for Normality. Journal of Statistical Computation and Simulation 80(5), 545–591.
Royston, J.P. (1992a). Approximating the Shapiro-Wilk W-Test for Non-Normality. Statistics and Computing 2, 117-119.
Royston, J.P. (1992b). Estimation, Reference Ranges and Goodness of Fit for the Three-Parameter Log-Normal Distribution. Statistics in Medicine 11, 897-912.
Royston, J.P. (1992c). A Pocket-Calculator Algorithm for the Shapiro-Francia Test of Non-Normality: An Application to Medicine. Statistics in Medicine 12, 181-184.
Royston, P. (1993). A Toolkit for Testing for Non-Normality in Complete and Censored Samples. The Statistician 42, 37-43.
Ryan, T., and B. Joiner. (1973). Normal Probability Plots and Tests for Normality. Technical Report, Pennsylvannia State University, Department of Statistics.
Shapiro, S.S., and R.S. Francia. (1972). An Approximate Analysis of Variance Test for Normality. Journal of the American Statistical Association 67(337), 215-219.
Shapiro, S.S., and M.B. Wilk. (1965). An Analysis of Variance Test for Normality (Complete Samples). Biometrika 52, 591-611.
Smirnov, N.V. (1939). Estimate of Deviation Between Empirical Distribution Functions in Two Independent Samples. Bulletin Moscow University 2(2), 3-16.
Smirnov, N.V. (1948). Table for Estimating the Goodness of Fit of Empirical Distributions. Annals of Mathematical Statistics 19, 279-281.
Stephens, M.A. (1970). Use of the Kolmogorov-Smirnov, Cramer-von Mises and Related Statistics Without Extensive Tables. Journal of the Royal Statistical Society, Series B, 32, 115-122.
Stephens, M.A. (1974). EDF Statistics for Goodness of Fit and Some Comparisons. Journal of the American Statistical Association 69, 730-737.
Stephens, M.A. (1986a). Tests Based on EDF Statistics. In D'Agostino, R. B., and M.A. Stevens, eds. Goodness-of-Fit Techniques. Marcel Dekker, New York.
Thode Jr., H.C. (2002). Testing for Normality. Marcel Dekker, New York.
USEPA. (2015). ProUCL Version 5.1.002 Technical Guide. EPA/600/R-07/041, October 2015. Office of Research and Development. U.S. Environmental Protection Agency, Washington, D.C.
Verrill, S., and R.A. Johnson. (1987). The Asymptotic Equivalence of Some Modified Shapiro-Wilk Statistics – Complete and Censored Sample Cases. The Annals of Statistics 15(1), 413-419.
Verrill, S., and R.A. Johnson. (1988). Tables and Large-Sample Distribution Theory for Censored-Data Correlation Statistics for Testing Normality. Journal of the American Statistical Association 83, 1192-1197.
Weisberg, S., and C. Bingham. (1975). An Approximate Analysis of Variance Test for Non-Normality Suitable for Machine Calculation. Technometrics 17, 133-134.
Wilk, M.B., and S.S. Shapiro. (1968). The Joint Assessment of Normality of Several Independent Samples. Technometrics, 10(4), 825-839.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. Prentice-Hall, Upper Saddle River, NJ.
See Also
rosnerTest
, gof.object
, print.gof
,
plot.gof
,
shapiro.test
, ks.test
, chisq.test
,
Normal, Lognormal, Lognormal3,
Zero-Modified Normal, Zero-Modified Lognormal (Delta),
enorm
, elnorm
, elnormAlt
,
elnorm3
, ezmnorm
, ezmlnorm
,
ezmlnormAlt
, qqPlot
.
Examples
# Generate 20 observations from a gamma distribution with
# parameters shape = 2 and scale = 3 then run various
# goodness-of-fit tests.
# (Note: the call to set.seed lets you reproduce this example.)
set.seed(47)
dat <- rgamma(20, shape = 2, scale = 3)
# Shapiro-Wilk generalized goodness-of-fit test
#----------------------------------------------
gof.list <- gofTest(dat, distribution = "gamma")
gof.list
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: Shapiro-Wilk GOF Based on
# Chen & Balakrisnan (1995)
#
#Hypothesized Distribution: Gamma
#
#Estimated Parameter(s): shape = 1.909462
# scale = 4.056819
#
#Estimation Method: mle
#
#Data: dat
#
#Sample Size: 20
#
#Test Statistic: W = 0.9834958
#
#Test Statistic Parameter: n = 20
#
#P-value: 0.970903
#
#Alternative Hypothesis: True cdf does not equal the
# Gamma Distribution.
dev.new()
plot(gof.list)
#----------
# Redo the example above, but use the bias-corrected mle
gofTest(dat, distribution = "gamma",
est.arg.list = list(method = "bcmle"))
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: Shapiro-Wilk GOF Based on
# Chen & Balakrisnan (1995)
#
#Hypothesized Distribution: Gamma
#
#Estimated Parameter(s): shape = 1.656376
# scale = 4.676680
#
#Estimation Method: bcmle
#
#Data: dat
#
#Sample Size: 20
#
#Test Statistic: W = 0.9834346
#
#Test Statistic Parameter: n = 20
#
#P-value: 0.9704046
#
#Alternative Hypothesis: True cdf does not equal the
# Gamma Distribution.
#----------
# Komogorov-Smirnov goodness-of-fit test (pre-specified parameters)
#------------------------------------------------------------------
gofTest(dat, test = "ks", distribution = "gamma",
param.list = list(shape = 2, scale = 3))
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: Kolmogorov-Smirnov GOF
#
#Hypothesized Distribution: Gamma(shape = 2, scale = 3)
#
#Data: dat
#
#Sample Size: 20
#
#Test Statistic: ks = 0.2313878
#
#Test Statistic Parameter: n = 20
#
#P-value: 0.2005083
#
#Alternative Hypothesis: True cdf does not equal the
# Gamma(shape = 2, scale = 3)
# Distribution.
#----------
# ProUCL Version of Komogorov-Smirnov goodness-of-fit test
# for a Gamma Distribution (estimated parameters)
#---------------------------------------------------------
gofTest(dat, test = "proucl.ks.gamma", distribution = "gamma")
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: ProUCL Kolmogorov-Smirnov Gamma GOF
#
#Hypothesized Distribution: Gamma
#
#Estimated Parameter(s): shape = 1.909462
# scale = 4.056819
#
#Estimation Method: MLE
#
#Data: dat
#
#Sample Size: 20
#
#Test Statistic: D = 0.0988692
#
#Test Statistic Parameter: n = 20
#
#Critical Values: D.0.01 = 0.228
# D.0.05 = 0.196
# D.0.10 = 0.180
#
#P-value: >= 0.10
#
#Alternative Hypothesis: True cdf does not equal the
# Gamma Distribution.
#----------
# Chi-squared goodness-of-fit test (estimated parameters)
#--------------------------------------------------------
gofTest(dat, test = "chisq", distribution = "gamma", n.classes = 4)
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: Chi-square GOF
#
#Hypothesized Distribution: Gamma
#
#Estimated Parameter(s): shape = 1.909462
# scale = 4.056819
#
#Estimation Method: mle
#
#Data: dat
#
#Sample Size: 20
#
#Test Statistic: Chi-square = 1.2
#
#Test Statistic Parameter: df = 1
#
#P-value: 0.2733217
#
#Alternative Hypothesis: True cdf does not equal the
# Gamma Distribution.
#----------
# Clean up
rm(dat, gof.list)
graphics.off()
#--------------------------------------------------------------------
# Example 10-2 of USEPA (2009, page 10-14) gives an example of
# using the Shapiro-Wilk test to test the assumption of normality
# for nickel concentrations (ppb) in groundwater collected over
# 4 years. The data for this example are stored in
# EPA.09.Ex.10.1.nickel.df.
EPA.09.Ex.10.1.nickel.df
# Month Well Nickel.ppb
#1 1 Well.1 58.8
#2 3 Well.1 1.0
#3 6 Well.1 262.0
#4 8 Well.1 56.0
#5 10 Well.1 8.7
#6 1 Well.2 19.0
#7 3 Well.2 81.5
#8 6 Well.2 331.0
#9 8 Well.2 14.0
#10 10 Well.2 64.4
#11 1 Well.3 39.0
#12 3 Well.3 151.0
#13 6 Well.3 27.0
#14 8 Well.3 21.4
#15 10 Well.3 578.0
#16 1 Well.4 3.1
#17 3 Well.4 942.0
#18 6 Well.4 85.6
#19 8 Well.4 10.0
#20 10 Well.4 637.0
# Test for a normal distribution:
#--------------------------------
gof.list <- gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df)
gof.list
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: Shapiro-Wilk GOF
#
#Hypothesized Distribution: Normal
#
#Estimated Parameter(s): mean = 169.5250
# sd = 259.7175
#
#Estimation Method: mvue
#
#Data: Nickel.ppb
#
#Data Source: EPA.09.Ex.10.1.nickel.df
#
#Sample Size: 20
#
#Test Statistic: W = 0.6788888
#
#Test Statistic Parameter: n = 20
#
#P-value: 2.17927e-05
#
#Alternative Hypothesis: True cdf does not equal the
# Normal Distribution.
dev.new()
plot(gof.list)
#----------
# Test for a lognormal distribution:
#-----------------------------------
gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df,
dist = "lnorm")
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: Shapiro-Wilk GOF
#
#Hypothesized Distribution: Lognormal
#
#Estimated Parameter(s): meanlog = 3.918529
# sdlog = 1.801404
#
#Estimation Method: mvue
#
#Data: Nickel.ppb
#
#Data Source: EPA.09.Ex.10.1.nickel.df
#
#Sample Size: 20
#
#Test Statistic: W = 0.978946
#
#Test Statistic Parameter: n = 20
#
#P-value: 0.9197735
#
#Alternative Hypothesis: True cdf does not equal the
# Lognormal Distribution.
#----------
# Test for a lognormal distribution, but use the
# Mean and CV parameterization:
#-----------------------------------------------
gofTest(Nickel.ppb ~ 1, data = EPA.09.Ex.10.1.nickel.df,
dist = "lnormAlt")
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: Shapiro-Wilk GOF
#
#Hypothesized Distribution: Lognormal
#
#Estimated Parameter(s): mean = 213.415628
# cv = 2.809377
#
#Estimation Method: mvue
#
#Data: Nickel.ppb
#
#Data Source: EPA.09.Ex.10.1.nickel.df
#
#Sample Size: 20
#
#Test Statistic: W = 0.978946
#
#Test Statistic Parameter: n = 20
#
#P-value: 0.9197735
#
#Alternative Hypothesis: True cdf does not equal the
# Lognormal Distribution.
#----------
# Clean up
rm(gof.list)
graphics.off()
#---------------------------------------------------------------------------
# Generate 20 observations from a normal distribution with mean=3 and sd=2, and
# generate 10 observaions from a normal distribution with mean=2 and sd=2 then
# test whether these sets of observations come from the same distribution.
# (Note: the call to set.seed simply allows you to reproduce this example.)
set.seed(300)
dat1 <- rnorm(20, mean = 3, sd = 2)
dat2 <- rnorm(10, mean = 1, sd = 2)
gofTest(x = dat1, y = dat2, test = "ks")
#Results of Goodness-of-Fit Test
#-------------------------------
#
#Test Method: 2-Sample K-S GOF
#
#Hypothesized Distribution: Equal
#
#Data: x = dat1
# y = dat2
#
#Sample Sizes: n.x = 20
# n.y = 10
#
#Test Statistic: ks = 0.7
#
#Test Statistic Parameters: n = 20
# m = 10
#
#P-value: 0.001669561
#
#Alternative Hypothesis: The cdf of 'dat1' does not equal
# the cdf of 'dat2'.
#----------
# Clean up
rm(dat1, dat2)