R: Empirical Cumulative Distribution Function Plot Based on Type...

ecdfPlotCensored {EnvStats}

R Documentation

Empirical Cumulative Distribution Function Plot Based on Type I Censored Data

Description

Produce an empirical cumulative distribution function plot for Type I left-censored or right-censored data.

Usage

  ecdfPlotCensored(x, censored, censoring.side = "left", discrete = FALSE,
    prob.method = "michael-schucany", plot.pos.con = 0.375, plot.it = TRUE,
    add = FALSE, ecdf.col = 1, ecdf.lwd = 3 * par("cex"), ecdf.lty = 1,
    include.cen = FALSE, cen.pch = ifelse(censoring.side == "left", 6, 2),
    cen.cex = par("cex"), cen.col = 4, ...,
    type = ifelse(discrete, "s", "l"), main = NULL, xlab = NULL, ylab = NULL,
    xlim = NULL, ylim = NULL)

Arguments

`x`	numeric vector of observations. Missing (`NA`), undefined (`NaN`), and infinite (`Inf`, `-Inf`) values are allowed but will be removed.
`censored`	numeric or logical vector indicating which values of `x` are censored. This must be the same length as `x`. If the mode of `censored` is `"logical"`, `TRUE` values correspond to elements of `x` that are censored, and `FALSE` values correspond to elements of `x` that are not censored. If the mode of `censored` is `"numeric"`, it must contain only `1`'s and `0`'s; `1` corresponds to `TRUE` and `0` corresponds to `FALSE`. Missing (`NA`) values are allowed but will be removed.
`censoring.side`	character string indicating on which side the censoring occurs. The possible values are `"left"` (the default) and `"right"`.
`discrete`	logical scalar indicating whether the assumed parent distribution of `x` is discrete (`discrete=TRUE`) or continuous (`discrete=FALSE`; the default).
`prob.method`	character string indicating what method to use to compute the plotting positions (empirical probabilities). Possible values are `"kaplan-meier"` (product-limit method of Kaplan and Meier (1958)), `"nelson"` (hazard plotting method of Nelson (1972)), `"michael-schucany"` (generalization of the product-limit method due to Michael and Schucany (1986)), and `"hirsch-stedinger"` (generalization of the product-limit method due to Hirsch and Stedinger (1987)). The default value is `prob.method="michael-schucany"`. The `"nelson"` method is only available for `censoring.side="right"`. See the DETAILS section for more explanation.
`plot.pos.con`	numeric scalar between 0 and 1 containing the value of the plotting position constant. The default value is `plot.pos.con=0.375`. See the DETAILS section for more information. This argument is used only if `prob.method` is equal to `"michael-schucany"` or `"hirsch-stedinger"`.
`plot.it`	logical scalar indicating whether to produce a plot or add to the current plot (see `add`) on the current graphics device. The default value is `plot.it=TRUE`.
`add`	logical scalar indicating whether to add the empirical cdf to the current plot (`add=TRUE`) or generate a new plot (`add=FALSE`; the default). This argument is ignored if `plot.it=FALSE`.
`ecdf.col`	a numeric scalar or character string determining the color of the empirical cdf line or points. The default value is `ecdf.col=1`. See the entry for `col` in the help file for `par` for more information.
`ecdf.lwd`	a numeric scalar determining the width of the empirical cdf line. The default value is `ecdf.lwd=3*par("cex")`. See the entry for `lwd` in the help file for `par` for more information.
`ecdf.lty`	a numeric scalar determining the line type of the empirical cdf line. The default value is `ecdf.lty=1`. See the entry for `lty` in the help file for `par` for more information.
`include.cen`	logical scalar indicating whether to include censored values in the plot. The default value is `include.cen=FALSE`. If `include.cen=TRUE`, censored values are plotted using the plotting character indicated by the argument `cen.pch` (see below).
`cen.pch`	numeric scalar or character string indicating the plotting character to use to plot censored values. The default value is `cen.pch=2` (hollow triangle pointing up) when `censoring.side="right"`, and `cen.pch=6` (hollow triangle pointing down) when `censoring.side="left"`. See the help file for `points` for a list of other possible plotting characters. This argument is ignored if `include.cen=FALSE`.
`cen.cex`	numeric scalar that determines the size of the plotting character used to plot censored values. The default value is the current value of the cex graphics parameter. See the entry for `cex` in the help file for `par` for more information. This argument is ignored if `include.cen=FALSE`.
`cen.col`	numeric scalar or character string that determines the color of the plotting character used to plot censored values. The default value is `cen.col=4`. See the entry for `col` in the help file for `par` for more information. This argument is ignored if `include.cen=FALSE`.
`type`, `main`, `xlab`, `ylab`, `xlim`, `ylim`, `...`	additional graphical parameters (see `lines` and `par`). In particular, the argument `type` specifies the kind of line type. By default, the function `ecdfPlotCensored` plots a step function (`type="s"`) when `discrete=TRUE`, and plots a straight line between points (`type="l"`) when `discrete=FALSE`. The user may override these defaults by supplying the graphics parameter `type` (`type="s"` for a step function, `type="l"` for linear interpolation, `type="p"` for points only, etc.).

Details

The function ecdfPlotCensored does exactly the same thing as ecdfPlot, except it calls the function ppointsCensored to compute the plotting positions (estimated cumulative probabilities) for the uncensored observations.

If plot.it=TRUE, the estimated cumulative probabilities for the uncensored observations are plotted against the uncensored observations. By default, the function ecdfPlotCensored plots a step function when discrete=TRUE, and plots a straight line between points when discrete=FALSE. The user may override these defaults by supplying the graphics parameter type (type="s" for a step function, type="l" for linear interpolation, type="p" for points only, etc.).

If include.cen=TRUE, censored observations are included on the plot as points. The arguments cen.pch, cen.cex, and cen.col control the appearance of these points.

In cases where x is a random sample, the emprical cdf will change from sample to sample and the variability in these estimates can be dramatic for small sample sizes. Caution must be used in interpreting the empirical cdf when a large percentage of the observations are censored.

Value

ecdfPlotCensored returns a list with the following components:

`Order.Statistics`	numeric vector of the “ordered” observations.
`Cumulative.Probabilities`	numeric vector of the associated plotting positions.
`Censored`	logical vector indicating which of the ordered observations are censored.
`Censoring.Side`	character string indicating whether the data are left- or right-censored. This is same value as the argument `censoring.side`.
`Prob.Method`	character string indicating what method was used to compute the plotting positions. This is the same value as the argument `prob.method`.

Optional Component (only present when prob.method="michael-schucany" or
prob.method="hirsch-stedinger"):

Plot.Pos.Con

numeric scalar containing the value of the plotting position constant that was used. This is the same as the argument plot.pos.con.

Note

An empirical cumulative distribution function (ecdf) plot is a graphical tool that can be used in conjunction with other graphical tools such as histograms, strip charts, and boxplots to assess the characteristics of a set of data.

Censored observations complicate the procedures used to graphically explore data. Techniques from survival analysis and life testing have been developed to generalize the procedures for constructing plotting positions, empirical cdf plots, and q-q plots to data sets with censored observations (see ppointsCensored).

Empirical cumulative distribution function (ecdf) plots are often plotted with theoretical cdf plots (see cdfPlot and cdfCompareCensored) to graphically assess whether a sample of observations comes from a particular distribution. More often, however, quantile-quantile (Q-Q) plots are used instead (see qqPlot and qqPlotCensored).

Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.

Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.

D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.

Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.

Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.

Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004.

Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727.

Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.

Lee, E.T., and J.W. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley & Sons, Hoboken, New Jersey, 513pp.

Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496.

Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

Examples

  # Generate 20 observations from a normal distribution with mean=20 and sd=5,
  # censor all observations less than 18, then generate an empirical cdf plot
  # for the complete data set and the censored data set.  Note that the empirical
  # cdf plot for the censored data set starts at the first ordered uncensored
  # observation, and that for values of x > 18 the two emprical cdf plots are
  # exactly the same.  This is because there is only one censoring level and
  # no uncensored observations fall below the censored observations.
  # (Note: the call to set.seed simply allows you to reproduce this example.)

  set.seed(333)
  x <- rnorm(20, mean=20, sd=5)
  censored <- x < 18

  sum(censored)
  #[1] 7

  new.x <- x
  new.x[censored] <- 18

  dev.new()
  ecdfPlot(x, xlim = range(pretty(x)),
    main = "Empirical CDF Plot for\nComplete Data Set")

  dev.new()
  ecdfPlotCensored(new.x, censored, xlim = range(pretty(x)),
    main="Empirical CDF Plot for\nCensored Data Set")

  # Clean up
  #---------
  rm(x, censored, new.x)

  #------------------------------------------------------------------------------------

  # Example 15-1 of USEPA (2009, page 15-10) gives an example of
  # computing plotting positions based on censored manganese
  # concentrations (ppb) in groundwater collected at 5 monitoring
  # wells.  The data for this example are stored in
  # EPA.09.Ex.15.1.manganese.df.  Here we will create an empirical
  # CDF plot based on the Kaplan-Meier method.

  EPA.09.Ex.15.1.manganese.df
  #   Sample   Well Manganese.Orig.ppb Manganese.ppb Censored
  #1       1 Well.1                 <5           5.0     TRUE
  #2       2 Well.1               12.1          12.1    FALSE
  #3       3 Well.1               16.9          16.9    FALSE
  #4       4 Well.1               21.6          21.6    FALSE
  #5       5 Well.1                 <2           2.0     TRUE
  #...
  #21      1 Well.5               17.9          17.9    FALSE
  #22      2 Well.5               22.7          22.7    FALSE
  #23      3 Well.5                3.3           3.3    FALSE
  #24      4 Well.5                8.4           8.4    FALSE
  #25      5 Well.5                 <2           2.0     TRUE

  dev.new()
  with(EPA.09.Ex.15.1.manganese.df,
    ecdfPlotCensored(Manganese.ppb, Censored,
      prob.method = "kaplan-meier", ecdf.col = "blue",
      main = "Empirical CDF of Manganese Data\nBased on Kaplan-Meier"))

  #==========

  # Clean up
  #---------
  graphics.off()

[Package EnvStats version 2.8.1 Index]