R: Empirical Cumulative Distribution Function Plot

ecdfPlot {EnvStats}

R Documentation

Empirical Cumulative Distribution Function Plot

Description

Produce an empirical cumulative distribution function plot.

Usage

  ecdfPlot(x, discrete = FALSE, 
    prob.method = ifelse(discrete, "emp.probs", "plot.pos"), 
    plot.pos.con = 0.375, plot.it = TRUE, add = FALSE, ecdf.col = "black", 
    ecdf.lwd = 3 * par("cex"), ecdf.lty = 1, curve.fill = FALSE, 
    curve.fill.col = "cyan", ..., type = ifelse(discrete, "s", "l"), 
    main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)

Arguments

`x`	numeric vector of observations. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are allowed but will be removed.
`discrete`	logical scalar indicating whether the assumed parent distribution of `x` is discrete (`discrete=TRUE`) or continuous (`discrete=FALSE`; the default).
`prob.method`	character string indicating what method to use to compute the plotting positions (empirical probabilities). Possible values are `plot.pos` (plotting positions, the default if `discrete=FALSE`) and `emp.probs` (empirical probabilities, the default if `discrete=TRUE`). See the DETAILS section for more explanation.
`plot.pos.con`	numeric scalar between 0 and 1 containing the value of the plotting position constant. The default value is `plot.pos.con=0.375`. See the DETAILS section for more information. This argument is ignored if `prob.method="emp.probs"`.
`plot.it`	logical scalar indicating whether to produce a plot or add to the current plot (see `add`) on the current graphics device. The default value is `plot.it=TRUE`.
`add`	logical scalar indicating whether to add the empirical cdf to the current plot (`add=TRUE`) or generate a new plot (`add=FALSE`; the default). This argument is ignored if `plot.it=FALSE`.
`ecdf.col`	a numeric scalar or character string determining the color of the empirical cdf line or points. The default value is `ecdf.col=1`. See the entry for `col` in the help file for `par` for more information.
`ecdf.lwd`	a numeric scalar determining the width of the empirical cdf line. The default value is `ecdf.lwd=3*par("cex")`. See the entry for `lwd` in the help file for `par` for more information.
`ecdf.lty`	a numeric scalar determining the line type of the empirical cdf line. The default value is `ecdf.lty=1`. See the entry for `lty` in the help file for `par` for more information.
`curve.fill`	a logical scalar indicating whether to fill in the area below the empirical cdf curve with the color specified by `curve.fill.col`. The default value is `curve.fill=FALSE`.
`curve.fill.col`	a numeric scalar or character string indicating what color to use to fill in the area below the empirical cdf curve. The default value is `curve.fill.col=5`. This argument is ignored if `curve.fill=FALSE`.
`type`, `main`, `xlab`, `ylab`, `xlim`, `ylim`, `...`	additional graphical parameters (see `lines` and `par`). In particular, the argument `type` specifies the kind of line type. By default, the function `ecdfPlot` plots a step function (`type="s"`) when `discrete=TRUE`, and plots a straight line between points (`type="l"`) when `discrete=FALSE`. The user may override these defaults by supplying the graphics parameter `type` (`type="s"` for a step function, `type="l"` for linear interpolation, `type="p"` for points only, etc.).

Details

The cumulative distribution function (cdf) of a random variable X is the function F such that

F(x) = Pr(X \le x) \;\;\;\;\;\; (1)

for all values of x. That is, if p = F(x), then p is the proportion of the population that is less than or equal to x, and x is called the p'th quantile, or the 100p'th percentile. A plot of quantiles on the x-axis (i.e., the possible value for the random variable X) vs. the fraction of the population less than or equal to that number on the y-axis is called the cumulative distribution function plot, and the y-axis is usually labeled as the “cumulative probability” or “cumulative frequency”.

When we have a sample of data from some population, we usually do not know what percentiles our observations correspond to because we do not know the form of the cumulative distribution function F, so we have to use the sample data to estimate the cdf F. An emprical cumulative distribution function (ecdf) plot, also called a quantile plot, is a plot of the observed quantiles (i.e., the ordered observations) on the x-axis vs. the estimated cumulative probabilities on the y-axis (Chambers et al., 1983, pp. 11-19; Cleveland, 1993, pp. 17-20; Cleveland, 1994, pp. 136-139; Helsel and Hirsch, 1992, pp. 21-24).

(Note: Some authors (e.g., Chambers et al., 1983, pp.11-16; Cleveland, 1993, pp.17-20) reverse the axes on a quantile plot, i.e., the observed order statistics from the random sample are on the y-axis and the estimated cumulative probabilities are on the x-axis.)

The empirical cumulative distribution function (ecdf) is an estimate of the cdf based on a random sample of n observations from the distribution. Let x_1, x_2, \ldots, x_n denote the n observations, and let x_{(1)}, x_{(2)}, \ldots, x_{(n)} denote the ordered observations (i.e., the order statistics). The cdf is usually estimated by either the empirical probabilities estimator or the plotting-position estimator. The empirical probabilities estimator is given by:

\hat{F}[x_{(i)}] = \hat{p}_i = \frac{\#[x_j \le x_{(i)}]}{n} \;\;\;\;\;\; (2)

where \#[x_j \le x_{(i)}] denotes the number of observations less than or equal to x_{(i)}. The plotting-position estimator is given by:

\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i - a}{n - 2a + 1} \;\;\;\;\;\; (3)

where 0 \le a \le 1 (Cleveland, 1993, p. 18; D'Agostino, 1986a, pp. 8,25).

For any value x such that x_{(1)} < x < x_{(n)}, the ecdf is usually defined as either a step function:

\hat{F}(x) = \hat{F}[x_{(i)}], \qquad x_{(i)} \le x < x_{(i+1)} \;\;\;\;\;\; (4)

(e.g., D'Agostino, 1986a), or linear interpolation between order statistics is used:

\hat{F}(x) = (1-r)\hat{F}[x_{(i)}] + r\hat{F}[x_{(i+1)}], \qquad x_{(i)} \le x < x_{(i+1)} \;\;\;\;\;\; (5)

where

r = \frac{x - x_{(i)}}{x_{(i+1)} - x_{(i)}} \;\;\;\;\;\; (6)

(e.g., Chambers et al., 1983). For the step function version, the ecdf stays flat until it hits a value on the x-axis corresponding to one of the order statistics, then it makes a jump. For the linear interpolation version, the ecdf plot looks like lines connecting the points. By default, the function ecdfPlot uses the step function version when discrete=TRUE, and the linear interpolation version when discrete=FALSE. The user may override these defaults by supplying the graphics parameter type (type="s" for a step function, type="l" for linear interpolation, type="p" for points only, etc.).

The empirical probabilities estimator is intuitively appealing. This is the estimator used when prob.method="emp.probs". The disadvantage of this estimator is that it implies the largest observed value is the maximum possible value of the distribution (i.e., the 100'th percentile). This may be satisfactory if the underlying distribution is known to be discrete, but it is usually not satisfactory if the underlying distribution is known to be continuous.

The plotting-position estimator with various values of a is often used when the goal is to produce a probability plot (see qqPlot) rather than an empirical cdf plot. It is used to compute the estimated expected values or medians of the order statistics for a probability plot. This is the estimator used when prob.method="plot.pos". The argument plot.pos.con refers to the variable a. Based on certain principles from statistical theory, certain values of the constant a make sense for specific underlying distributions (see the help file for qqPlot for more information).

Because x is a random sample, the emprical cdf changes from sample to sample and the variability in these estimates can be dramatic for small sample sizes.

Value

ecdfPlot invisibly returns a list with the following components:

`Order.Statistics`	numeric vector of the ordered observations.
`Cumulative.Probabilities`	numeric vector of the associated plotting positions.

Note

An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data. It is easy to determine quartiles and the minimum and maximum values from such a plot. Also, ecdf plots allow you to assess local density: a higher density of observations occurs where the slope is steep.

Chambers et al. (1983, pp.11-16) plot the observed order statistics on the y-axis vs. the ecdf on the x-axis and call this a quantile plot.

Empirical cumulative distribution function (ecdf) plots are often plotted with theoretical cdf plots (see cdfPlot and cdfCompare) to graphically assess whether a sample of observations comes from a particular distribution. The Kolmogorov-Smirnov goodness-of-fit test (see gofTest) is the statistical companion of this kind of comparison; it is based on the maximum vertical distance between the empirical cdf plot and the theoretical cdf plot. More often, however, quantile-quantile (Q-Q) plots are used instead of ecdf plots to graphically assess departures from an assumed distribution (see qqPlot).

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.

Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.

D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.

Examples

  # Generate 20 observations from a normal distribution with 
  # mean=0 and sd=1 and create an ecdf plot. 
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(250) 
  x <- rnorm(20) 
  dev.new()
  ecdfPlot(x)

  #----------

  # Repeat the above example, but fill in the area under the 
  # empirical cdf curve.

  dev.new()
  ecdfPlot(x, curve.fill = TRUE)

  #----------

  # Repeat the above example, but plot only the points.

  dev.new()
  ecdfPlot(x, type = "p")

  #----------

  # Repeat the above example, but force a step function.

  dev.new()
  ecdfPlot(x, type = "s")

  #----------

  # Clean up
  rm(x)

  #-------------------------------------------------------------------------------------

  # The guidance document USEPA (1994b, pp. 6.22--6.25) 
  # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) 
  # concentrations (in parts per billion) from soil samples 
  # at a Reference area and a Cleanup area.  These data are strored 
  # in the data frame EPA.94b.tccb.df.  
  #
  # Create an empirical CDF plot for the reference area data.
  
  dev.new()
  with(EPA.94b.tccb.df, 
    ecdfPlot(TcCB[Area == "Reference"], xlab = "TcCB (ppb)"))

  #==========

  # Clean up
  #---------
  graphics.off()

[Package EnvStats version 2.8.1 Index]