cdfCompareCensored {EnvStats} | R Documentation |
Plot Two Cumulative Distribution Functions Based on Censored Data
Description
For one sample, plots the empirical cumulative distribution function (ecdf) along with a theoretical cumulative distribution function (cdf). For two samples, plots the two ecdf's. These plots are used to graphically assess goodness of fit.
Usage
cdfCompareCensored(x, censored, censoring.side = "left",
y = NULL, y.censored = NULL, y.censoring.side = censoring.side,
discrete = FALSE, prob.method = "michael-schucany",
plot.pos.con = NULL, distribution = "norm", param.list = NULL,
estimate.params = is.null(param.list), est.arg.list = NULL,
x.col = "blue", y.or.fitted.col = "black", x.lwd = 3 * par("cex"),
y.or.fitted.lwd = 3 * par("cex"), x.lty = 1, y.or.fitted.lty = 2,
include.x.cen = FALSE, x.cen.pch = ifelse(censoring.side == "left", 6, 2),
x.cen.cex = par("cex"), x.cen.col = "red",
include.y.cen = FALSE, y.cen.pch = ifelse(y.censoring.side == "left", 6, 2),
y.cen.cex = par("cex"), y.cen.col = "black", digits = .Options$digits, ...,
type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL,
xlim = NULL, ylim = NULL)
Arguments
x |
numeric vector of observations. Missing ( |
censored |
numeric or logical vector indicating which values of |
censoring.side |
character string indicating on which side the censoring occurs. The possible values are
|
y |
a numeric vector (not necessarily of the same length as |
y.censored |
numeric or logical vector indicating which values of This argument is ignored when |
y.censoring.side |
character string indicating on which side the censoring occurs for the values of
|
discrete |
logical scalar indicating whether the assumed parent distribution of |
prob.method |
character string indicating what method to use to compute the plotting positions (empirical probabilities).
Possible values are
The |
plot.pos.con |
numeric scalar between 0 and 1 containing the value of the plotting position constant.
When |
distribution |
when |
param.list |
when |
estimate.params |
when |
est.arg.list |
when |
x.col |
a numeric scalar or character string determining the color of the empirical cdf
(based on |
y.or.fitted.col |
a numeric scalar or character string determining the color of the empirical cdf
(based on |
x.lwd |
a numeric scalar determining the width of the empirical cdf (based on |
y.or.fitted.lwd |
a numeric scalar determining the width of the empirical cdf (based on |
x.lty |
a numeric scalar determining the line type of the empirical cdf
(based on |
y.or.fitted.lty |
a numeric scalar determining the line type of the empirical cdf
(based on |
include.x.cen |
logical scalar indicating whether to include censored values in |
x.cen.pch |
numeric scalar or character string indicating the plotting character to use to plot
censored values in |
x.cen.cex |
numeric scalar that determines the size of the plotting character used to plot
censored values in |
x.cen.col |
numeric scalar or character string that determines the color of the plotting
character used to plot censored values in |
include.y.cen |
logical scalar indicating whether to include censored values in |
y.cen.pch |
numeric scalar or character string indicating the plotting character to use to plot
censored values in |
y.cen.cex |
numeric scalar that determines the size of the plotting character used to plot
censored values in |
y.cen.col |
numeric scalar or character string that determines the color of the plotting
character used to plot censored values in |
digits |
when |
type , main , xlab , ylab , xlim , ylim , ... |
additional graphical parameters (see |
Details
When both x
and y
are supplied, the function cdfCompareCensored
creates the empirical cdf plot of x
and y
on
the same plot by calling the function ecdfPlotCensored
.
When y
is not supplied, the function cdfCompareCensored
creates the
emprical cdf plot of x
(by calling ecdfPlotCensored
) and the
theoretical cdf plot (by calling cdfPlot
and using the
argument distribution
) on the same plot.
Value
When y
is supplied, cdfCompareCensored
invisibly returns a list with
components:
x.ecdf.list |
a list with components |
y.ecdf.list |
a list with components |
When y
is not supplied, cdfCompareCensored
invisibly returns a list with
components:
x.ecdf.list |
a list with components |
fitted.cdf.list |
a list with components |
Note
An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data. It is easy to determine quartiles and the minimum and maximum values from such a plot. Also, ecdf plots allow you to assess local density: a higher density of observations occurs where the slope is steep.
Chambers et al. (1983, pp.11-16) plot the observed order statistics on the
y
-axis vs. the ecdf on the x
-axis and call this a quantile plot.
Censored observations complicate the procedures used to graphically explore data.
Techniques from survival analysis and life testing have been developed to generalize
the procedures for constructing plotting positions, empirical cdf plots, and
q-q plots to data sets with censored observations
(see ppointsCensored
).
Empirical cumulative distribution function (ecdf) plots are often plotted with
theoretical cdf plots to graphically assess whether a sample of observations
comes from a particular distribution. More often, however, quantile-quantile
(Q-Q) plots are used instead of ecdf plots to graphically assess departures from
an assumed distribution (see qqPlotCensored
).
Author(s)
Steven P. Millard (EnvStats@ProbStatInfo.com)
References
Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.
Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.
D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.
Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.
Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.
Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004.
Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727.
Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.
Lee, E.T., and J.W. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley & Sons, Hoboken, New Jersey, 513pp.
Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496.
Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966.
USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.
USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.
See Also
cdfPlot
, ecdfPlotCensored
, qqPlotCensored
.
Examples
# Generate 20 observations from a normal distribution with mean=20 and sd=5,
# censor all observations less than 18, then compare the empirical cdf with a
# theoretical normal cdf that is based on estimating the parameters.
# (Note: the call to set.seed simply allows you to reproduce this example.)
set.seed(333)
x <- sort(rnorm(20, mean=20, sd=5))
x
# [1] 9.743551 12.370197 14.375499 15.628482 15.883507 17.080124
# [7] 17.197588 18.097714 18.654182 19.585942 20.219308 20.268505
#[13] 20.552964 21.388695 21.763587 21.823639 23.168039 26.165269
#[19] 26.843362 29.673405
censored <- x < 18
x[censored] <- 18
sum(censored)
#[1] 7
dev.new()
cdfCompareCensored(x, censored)
# Clean up
#---------
rm(x, censored)
#==========
# Example 15-1 of USEPA (2009, page 15-10) gives an example of
# computing plotting positions based on censored manganese
# concentrations (ppb) in groundwater collected at 5 monitoring
# wells. The data for this example are stored in
# EPA.09.Ex.15.1.manganese.df. Here we will compare the empirical
# cdf based on Kaplan-Meier plotting positions or Michael-Schucany
# plotting positions with various assumed distributions
# (based on estimating the parameters of these distributions):
# 1) normal distribution
# 2) lognormal distribution
# 3) gamma distribution
# First look at the data:
#------------------------
EPA.09.Ex.15.1.manganese.df
# Sample Well Manganese.Orig.ppb Manganese.ppb Censored
#1 1 Well.1 <5 5.0 TRUE
#2 2 Well.1 12.1 12.1 FALSE
#3 3 Well.1 16.9 16.9 FALSE
#4 4 Well.1 21.6 21.6 FALSE
#5 5 Well.1 <2 2.0 TRUE
#...
#21 1 Well.5 17.9 17.9 FALSE
#22 2 Well.5 22.7 22.7 FALSE
#23 3 Well.5 3.3 3.3 FALSE
#24 4 Well.5 8.4 8.4 FALSE
#25 5 Well.5 <2 2.0 TRUE
longToWide(EPA.09.Ex.15.1.manganese.df,
"Manganese.Orig.ppb", "Sample", "Well",
paste.row.name = TRUE)
# Well.1 Well.2 Well.3 Well.4 Well.5
#Sample.1 <5 <5 <5 6.3 17.9
#Sample.2 12.1 7.7 5.3 11.9 22.7
#Sample.3 16.9 53.6 12.6 10 3.3
#Sample.4 21.6 9.5 106.3 <2 8.4
#Sample.5 <2 45.9 34.5 77.2 <2
# Assume a normal distribution
#-----------------------------
# Michael-Schucany plotting positions:
dev.new()
with(EPA.09.Ex.15.1.manganese.df,
cdfCompareCensored(Manganese.ppb, Censored))
# Kaplan-Meier plotting positions:
dev.new()
with(EPA.09.Ex.15.1.manganese.df,
cdfCompareCensored(Manganese.ppb, Censored,
prob.method = "kaplan-meier"))
# Assume a lognormal distribution
#--------------------------------
# Michael-Schucany plotting positions:
dev.new()
with(EPA.09.Ex.15.1.manganese.df,
cdfCompareCensored(Manganese.ppb, Censored, dist = "lnorm"))
# Kaplan-Meier plotting positions:
dev.new()
with(EPA.09.Ex.15.1.manganese.df,
cdfCompareCensored(Manganese.ppb, Censored, dist = "lnorm",
prob.method = "kaplan-meier"))
# Assume a gamma distribution
#----------------------------
# Michael-Schucany plotting positions:
dev.new()
with(EPA.09.Ex.15.1.manganese.df,
cdfCompareCensored(Manganese.ppb, Censored, dist = "gamma"))
# Kaplan-Meier plotting positions:
dev.new()
with(EPA.09.Ex.15.1.manganese.df,
cdfCompareCensored(Manganese.ppb, Censored, dist = "gamma",
prob.method = "kaplan-meier"))
# Clean up
#---------
graphics.off()
#==========
# Compare the distributions of copper and zinc between the Alluvial Fan Zone
# and the Basin-Trough Zone using the data of Millard and Deverel (1988).
# The data are stored in Millard.Deverel.88.df.
Millard.Deverel.88.df
# Cu.orig Cu Cu.censored Zn.orig Zn Zn.censored Zone Location
#1 < 1 1 TRUE <10 10 TRUE Alluvial.Fan 1
#2 < 1 1 TRUE 9 9 FALSE Alluvial.Fan 2
#3 3 3 FALSE NA NA FALSE Alluvial.Fan 3
#.
#.
#.
#116 5 5 FALSE 50 50 FALSE Basin.Trough 48
#117 14 14 FALSE 90 90 FALSE Basin.Trough 49
#118 4 4 FALSE 20 20 FALSE Basin.Trough 50
Cu.AF <- with(Millard.Deverel.88.df,
Cu[Zone == "Alluvial.Fan"])
Cu.AF.cen <- with(Millard.Deverel.88.df,
Cu.censored[Zone == "Alluvial.Fan"])
Cu.BT <- with(Millard.Deverel.88.df,
Cu[Zone == "Basin.Trough"])
Cu.BT.cen <- with(Millard.Deverel.88.df,
Cu.censored[Zone == "Basin.Trough"])
Zn.AF <- with(Millard.Deverel.88.df,
Zn[Zone == "Alluvial.Fan"])
Zn.AF.cen <- with(Millard.Deverel.88.df,
Zn.censored[Zone == "Alluvial.Fan"])
Zn.BT <- with(Millard.Deverel.88.df,
Zn[Zone == "Basin.Trough"])
Zn.BT.cen <- with(Millard.Deverel.88.df,
Zn.censored[Zone == "Basin.Trough"])
# First compare the copper concentrations
#----------------------------------------
dev.new()
cdfCompareCensored(x = Cu.AF, censored = Cu.AF.cen,
y = Cu.BT, y.censored = Cu.BT.cen)
# Now compare the zinc concentrations
#------------------------------------
dev.new()
cdfCompareCensored(x = Zn.AF, censored = Zn.AF.cen,
y = Zn.BT, y.censored = Zn.BT.cen)
# Compare the Zinc concentrations again, but delete
# the one "outlier".
#--------------------------------------------------
summaryStats(Zn.AF)
# N Mean SD Median Min Max NA's N.Total
#Zn.AF 67 23.5075 74.4192 10 3 620 1 68
summaryStats(Zn.BT)
# N Mean SD Median Min Max
#Zn.BT 50 21.94 18.7044 18.5 3 90
which(Zn.AF == 620)
#[1] 38
summaryStats(Zn.AF[-38])
# N Mean SD Median Min Max NA's N.Total
#Zn.AF[-38] 66 14.4697 8.1604 10 3 50 1 67
dev.new()
cdfCompareCensored(x = Zn.AF[-38], censored = Zn.AF.cen[-38],
y = Zn.BT, y.censored = Zn.BT.cen)
#----------
# Clean up
#---------
rm(Cu.AF, Cu.AF.cen, Cu.BT, Cu.BT.cen,
Zn.AF, Zn.AF.cen, Zn.BT, Zn.BT.cen)
graphics.off()