ecdfPlot {EnvStats}  R Documentation 
Produce an empirical cumulative distribution function plot.
ecdfPlot(x, discrete = FALSE,
prob.method = ifelse(discrete, "emp.probs", "plot.pos"),
plot.pos.con = 0.375, plot.it = TRUE, add = FALSE, ecdf.col = "black",
ecdf.lwd = 3 * par("cex"), ecdf.lty = 1, curve.fill = FALSE,
curve.fill.col = "cyan", ..., type = ifelse(discrete, "s", "l"),
main = NULL, xlab = NULL, ylab = NULL, xlim = NULL, ylim = NULL)
x 
numeric vector of observations. Missing ( 
discrete 
logical scalar indicating whether the assumed parent distribution of 
prob.method 
character string indicating what method to use to compute the plotting positions (empirical probabilities).
Possible values are 
plot.pos.con 
numeric scalar between 0 and 1 containing the value of the plotting position constant.
The default value is 
plot.it 
logical scalar indicating whether to produce a plot or add to the current plot (see 
add 
logical scalar indicating whether to add the empirical cdf to the current plot ( 
ecdf.col 
a numeric scalar or character string determining the color of the empirical cdf line or points.
The default value is 
ecdf.lwd 
a numeric scalar determining the width of the empirical cdf line. The default value is

ecdf.lty 
a numeric scalar determining the line type of the empirical cdf line. The default value is

curve.fill 
a logical scalar indicating whether to fill in the area below the empirical cdf curve with the
color specified by 
curve.fill.col 
a numeric scalar or character string indicating what color to use to fill in the area below the
empirical cdf curve. The default value is 
type , main , xlab , ylab , xlim , ylim , ... 
additional graphical parameters (see 
The cumulative distribution function (cdf) of a random variable
X
is the function F
such that
F(x) = Pr(X \le x) \;\;\;\;\;\; (1)
for all values of x
. That is, if p = F(x)
, then p
is the
proportion of the population that is less than or equal to x
, and
x
is called the p
'th quantile, or the 100p
'th
percentile. A plot of quantiles
on the x
axis (i.e., the possible value for the random variable X
) vs.
the fraction of the population less than or equal to that number on the
y
axis is called the cumulative distribution function plot, and
the y
axis is usually labeled as the
“cumulative probability” or “cumulative frequency”.
When we have a sample of data from some population, we usually do not
know what percentiles our observations correspond to because we do not
know the form of the cumulative distribution function F
, so we
have to use the sample data to estimate the cdf F
. An
emprical cumulative distribution function (ecdf) plot,
also called a quantile plot, is a plot of the observed
quantiles (i.e., the ordered observations) on the x
axis vs.
the estimated cumulative probabilities on the y
axis
(Chambers et al., 1983, pp. 1119; Cleveland, 1993, pp. 1720;
Cleveland, 1994, pp. 136139; Helsel and Hirsch, 1992, pp. 2124).
(Note: Some authors (e.g., Chambers et al., 1983, pp.1116; Cleveland, 1993, pp.1720)
reverse the axes on a quantile plot, i.e., the observed order statistics from the
random sample are on the y
axis and the estimated cumulative probabilities
are on the x
axis.)
The empirical cumulative distribution function (ecdf)
is an estimate of the cdf based on a random sample of n
observations
from the distribution. Let x_1, x_2, \ldots, x_n
denote the n
observations, and let x_{(1)}, x_{(2)}, \ldots, x_{(n)}
denote the ordered
observations (i.e., the order statistics). The cdf is usually estimated by either
the empirical probabilities estimator or the
plottingposition estimator. The empirical probabilities estimator
is given by:
\hat{F}[x_{(i)}] = \hat{p}_i = \frac{\#[x_j \le x_{(i)}]}{n} \;\;\;\;\;\; (2)
where \#[x_j \le x_{(i)}]
denotes the number of observations less than
or equal to x_{(i)}
. The plottingposition estimator is given by:
\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i  a}{n  2a + 1} \;\;\;\;\;\; (3)
where 0 \le a \le 1
(Cleveland, 1993, p. 18; D'Agostino, 1986a, pp. 8,25).
For any value x
such that x_{(1)} < x < x_{(n)}
, the ecdf is usually defined as either a step function:
\hat{F}(x) = \hat{F}[x_{(i)}], \qquad x_{(i)} \le x < x_{(i+1)} \;\;\;\;\;\; (4)
(e.g., D'Agostino, 1986a), or linear interpolation between order statistics is used:
\hat{F}(x) = (1r)\hat{F}[x_{(i)}] + r\hat{F}[x_{(i+1)}], \qquad x_{(i)} \le x < x_{(i+1)} \;\;\;\;\;\; (5)
where
r = \frac{x  x_{(i)}}{x_{(i+1)}  x_{(i)}} \;\;\;\;\;\; (6)
(e.g., Chambers et al., 1983). For the step function version, the ecdf stays flat until it hits a
value on the x
axis corresponding to one of the order statistics, then it makes a jump.
For the linear interpolation version, the ecdf plot looks like lines connecting the points.
By default, the function ecdfPlot
uses the step function version when discrete=TRUE
, and
the linear interpolation version when discrete=FALSE
. The user may override these defaults by
supplying the graphics parameter type
(type="s"
for a step function, type="l"
for linear interpolation, type="p"
for points only, etc.).
The empirical probabilities estimator is intuitively appealing. This is the estimator used when
prob.method="emp.probs"
. The disadvantage of this estimator is that it implies the largest
observed value is the maximum possible value of the distribution (i.e., the 100'th percentile). This
may be satisfactory if the underlying distribution is known to be discrete, but it is usually not
satisfactory if the underlying distribution is known to be continuous.
The plottingposition estimator with various values of a
is often used when the goal is
to produce a probability plot (see qqPlot
) rather than an empirical cdf plot. It is used
to compute the estimated expected values or medians of the order statistics for a probability plot.
This is the estimator used when prob.method="plot.pos"
. The argument plot.pos.con
refers
to the variable a
. Based on certain principles from statistical theory, certain
values of the constant a
make sense for specific underlying distributions (see
the help file for qqPlot
for more information).
Because x
is a random sample, the emprical cdf changes from sample to sample and the variability
in these estimates can be dramatic for small sample sizes.
ecdfPlot
invisibly returns a list with the following components:
Order.Statistics 
numeric vector of the ordered observations. 
Cumulative.Probabilities 
numeric vector of the associated plotting positions. 
An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data. It is easy to determine quartiles and the minimum and maximum values from such a plot. Also, ecdf plots allow you to assess local density: a higher density of observations occurs where the slope is steep.
Chambers et al. (1983, pp.1116) plot the observed order statistics on the
y
axis vs. the ecdf on the x
axis and call this a quantile plot.
Empirical cumulative distribution function (ecdf) plots are often plotted with
theoretical cdf plots (see cdfPlot
and cdfCompare
) to
graphically assess whether a sample of observations comes from a particular
distribution. The KolmogorovSmirnov goodnessoffit test
(see gofTest
) is the statistical companion of this kind of
comparison; it is based on the maximum vertical distance between the empirical
cdf plot and the theoretical cdf plot. More often, however,
quantilequantile (QQ) plots are used instead of ecdf plots to graphically assess
departures from an assumed distribution (see qqPlot
).
Steven P. Millard (EnvStats@ProbStatInfo.com)
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.1116.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodnessof Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.762.
ppoints
, cdfPlot
, cdfCompare
,
qqPlot
, ecdfPlotCensored
.
# Generate 20 observations from a normal distribution with
# mean=0 and sd=1 and create an ecdf plot.
# (Note: the call to set.seed simply allows you to reproduce this example.)
set.seed(250)
x < rnorm(20)
dev.new()
ecdfPlot(x)
#
# Repeat the above example, but fill in the area under the
# empirical cdf curve.
dev.new()
ecdfPlot(x, curve.fill = TRUE)
#
# Repeat the above example, but plot only the points.
dev.new()
ecdfPlot(x, type = "p")
#
# Repeat the above example, but force a step function.
dev.new()
ecdfPlot(x, type = "s")
#
# Clean up
rm(x)
#
# The guidance document USEPA (1994b, pp. 6.226.25)
# contains measures of 1,2,3,4Tetrachlorobenzene (TcCB)
# concentrations (in parts per billion) from soil samples
# at a Reference area and a Cleanup area. These data are strored
# in the data frame EPA.94b.tccb.df.
#
# Create an empirical CDF plot for the reference area data.
dev.new()
with(EPA.94b.tccb.df,
ecdfPlot(TcCB[Area == "Reference"], xlab = "TcCB (ppb)"))
#==========
# Clean up
#
graphics.off()