qqPlotCensored {EnvStats} R Documentation

## Quantile-Quantile (Q-Q) Plot for Type I Censored Data

### Description

Produces a quantile-quantile (Q-Q) plot, also called a probability plot, for Type I censored data.

### Usage

  qqPlotCensored(x, censored, censoring.side = "left",
prob.method = "michael-schucany", plot.pos.con = NULL,
distribution = "norm", param.list = list(mean = 0, sd = 1),
estimate.params = plot.type == "Tukey Mean-Difference Q-Q",
est.arg.list = NULL, plot.type = "Q-Q", plot.it = TRUE,
equal.axes = qq.line.type == "0-1" || estimate.params,
add.line = FALSE, qq.line.type = "least squares",
duplicate.points.method = "standard", points.col = 1, line.col = 1,

### Details

The function qqPlotCensored does exactly the same thing as qqPlot (when the argument y is not supplied to qqPlot), except qqPlotCensored calls the function ppointsCensored to compute the plotting positions (estimated cumulative probabilities).

The vector x is assumed to be a sample from the probability distribution specified by the argument distribution (and param.list if estimate.params=FALSE). When plot.type="Q-Q", the quantiles of x are plotted on the y-axis against the quantiles of the assumed distribution on the x-axis.

When plot.type="Tukey Mean-Difference Q-Q", the difference of the quantiles is plotted on the y-axis against the mean of the quantiles on the x-axis.

When prob.method="kaplan-meier" and censoring.side="left" and the assumed distribution has a maximum support of infinity (Inf; e.g., the normal or lognormal distribution), the point invovling the largest value of x is not plotted because it corresponds to an estimated cumulative probability of 1 which corresponds to an infinite plotting position.

When prob.method="modified kaplan-meier" and censoring.side="left", the estimated cumulative probability associated with the maximum value is modified from 1 to be (N - .375)/(N + .25) where N denotes the sample size (i.e., the Blom plotting position) so that the point associated with the maximum value can be displayed.

### Value

qqPlotCensored returns a list with the following components:

 x numeric vector of x-coordinates for the plot. When plot.type="Q-Q" these are the quantiles from the theoretical distribution. When plot.type="Tukey Mean-Difference Q-Q" these are the averages of the observed and theoretical quantiles. y numeric vector of y-coordinates for the plot. When plot.type="Q-Q" these are the observed quantiles (order statistics). When plot.type="Tukey Mean-Difference Q-Q" these are the differences between the observed quantiles (order statistics) and the theoretical quantiles. Order.Statistics numeric vector of the “ordered” observations. When plot.type="Q-Q" this component is exactly the same as the component y. Cumulative.Probabilities numeric vector of the plotting positions associated with the order statistics. Censored logical vector indicating which of the ordered observations are censored. Censoring.Side character string indicating whether the data are left- or right-censored. This is same value as the argument censoring.side. Prob.Method character string indicating what method was used to compute the plotting positions. This is the same value as the argument prob.method.

Optional Component (only present when prob.method="michael-schucany" or
prob.method="hirsch-stedinger"):

 Plot.Pos.Con numeric scalar containing the value of the plotting position constant that was used. This is the same as the argument plot.pos.con.

### Note

A quantile-quantile (Q-Q) plot, also called a probability plot, is a plot of the observed order statistics from a random sample (the empirical quantiles) against their (estimated) mean or median values based on an assumed distribution, or against the empirical quantiles of another set of data (Wilk and Gnanadesikan, 1968). Q-Q plots are used to assess whether data come from a particular distribution, or whether two datasets have the same parent distribution. If the distributions have the same shape (but not necessarily the same location or scale parameters), then the plot will fall roughly on a straight line. If the distributions are exactly the same, then the plot will fall roughly on the straight line y=x.

A Tukey mean-difference Q-Q plot, also called an m-d plot, is a modification of a Q-Q plot. Rather than plotting observed quantiles vs. theoretical quantiles or observed y-quantiles vs. observed x-quantiles, a Tukey mean-difference Q-Q plot plots the difference between the quantiles on the y-axis vs. the average of the quantiles on the x-axis (Cleveland, 1993, pp.22-23). If the two sets of quantiles come from the same parent distribution, then the points in this plot should fall roughly along the horizontal line y=0. If one set of quantiles come from the same distribution with a shift in median, then the points in this plot should fall along a horizontal line above or below the line y=0. A Tukey mean-difference Q-Q plot enhances our perception of how the points in the Q-Q plot deviate from a straight line, because it is easier to judge deviations from a horizontal line than from a line with a non-zero slope.

In a Q-Q plot, the extreme points have more variability than points toward the center. A U-shaped Q-Q plot indicates that the underlying distribution for the observations on the y-axis is skewed to the right relative to the underlying distribution for the observations on the x-axis. An upside-down-U-shaped Q-Q plot indicates the y-axis distribution is skewed left relative to the x-axis distribution. An S-shaped Q-Q plot indicates the y-axis distribution has shorter tails than the x-axis distribution. Conversely, a plot that is bent down on the left and bent up on the right indicates that the y-axis distribution has longer tails than the x-axis distribution.

Censored observations complicate the procedures used to graphically explore data. Techniques from survival analysis and life testing have been developed to generalize the procedures for constructing plotting positions, empirical cdf plots, and Q-Q plots to data sets with censored observations (see ppointsCensored).

### Author(s)

Steven P. Millard (EnvStats@ProbStatInfo.com)

### References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.

Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.

D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.

Gillespie, B.W., Q. Chen, H. Reichert, A. Franzblau, E. Hedgeman, J. Lepkowski, P. Adriaens, A. Demond, W. Luksemburg, and D.H. Garabrant. (2010). Estimating Population Distributions When Some Data Are Below a Limit of Detection by Using a Reverse Kaplan-Meier Estimator. Epidemiology 21(4), S64–S70.

Helsel, D.R. (2012). Statistics for Censored Environmental Data Using Minitab and R, Second Edition. John Wiley & Sons, Hoboken, New Jersey.

Helsel, D.R., and T.A. Cohn. (1988). Estimation of Descriptive Statistics for Multiply Censored Water Quality Data. Water Resources Research 24(12), 1997-2004.

Hirsch, R.M., and J.R. Stedinger. (1987). Plotting Positions for Historical Floods and Their Precision. Water Resources Research 23(4), 715-727.

Kaplan, E.L., and P. Meier. (1958). Nonparametric Estimation From Incomplete Observations. Journal of the American Statistical Association 53, 457-481.

Lee, E.T., and J. Wang. (2003). Statistical Methods for Survival Data Analysis, Third Edition. John Wiley and Sons, New York.

Michael, J.R., and W.R. Schucany. (1986). Analysis of Data from Censored Samples. In D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, 560pp, Chapter 11, 461-496.

Nelson, W. (1972). Theory and Applications of Hazard Plotting for Censored Failure Data. Technometrics 14, 945-966.

USEPA. (2009). Statistical Analysis of Groundwater Monitoring Data at RCRA Facilities, Unified Guidance. EPA 530/R-09-007, March 2009. Office of Resource Conservation and Recovery Program Implementation and Information Division. U.S. Environmental Protection Agency, Washington, D.C. Chapter 15.

USEPA. (2010). Errata Sheet - March 2009 Unified Guidance. EPA 530/R-09-007a, August 9, 2010. Office of Resource Conservation and Recovery, Program Information and Implementation Division. U.S. Environmental Protection Agency, Washington, D.C.

ppointsCensored, EnvStats Functions for Censored Data, qqPlot, ecdfPlotCensored,
qqPlotGestalt.

### Examples

  # Generate 20 observations from a normal distribution with mean=20 and sd=5,
# censor all observations less than 18, then generate a Q-Q plot assuming
# a normal distribution for the complete data set and the censored data set.
# Note that the Q-Q plot for the censored data set starts at the first ordered
# uncensored observation, and that for values of x > 18 the two Q-Q plots are
# exactly the same.  This is because there is only one censoring level and
# no uncensored observations fall below the censored observations.
# (Note: the call to set.seed simply allows you to reproduce this example.)

set.seed(333)
x <- rnorm(20, mean=20, sd=5)
censored <- x < 18

sum(censored)
#[1] 7

new.x <- x
new.x[censored] <- 18

dev.new()
qqPlot(x, ylim = range(pretty(x)),
main = "Q-Q Plot for\nComplete Data Set")

dev.new()
qqPlotCensored(new.x, censored, ylim = range(pretty(x)),
main="Q-Q Plot for\nCensored Data Set")

# Clean up
#---------
rm(x, censored, new.x)

#------------------------------------------------------------------------------------

# Example 15-1 of USEPA (2009, page 15-10) gives an example of
# computing plotting positions based on censored manganese
# concentrations (ppb) in groundwater collected at 5 monitoring
# wells.  The data for this example are stored in
# EPA.09.Ex.15.1.manganese.df.  Here we will create a Q-Q
# plot based on the Kaplan-Meier method.  First we'll assume
# a normal distribution, then a lognormal distribution, then a
# gamma distribution.

EPA.09.Ex.15.1.manganese.df
#   Sample   Well Manganese.Orig.ppb Manganese.ppb Censored
#1       1 Well.1                 <5           5.0     TRUE
#2       2 Well.1               12.1          12.1    FALSE
#3       3 Well.1               16.9          16.9    FALSE
#4       4 Well.1               21.6          21.6    FALSE
#5       5 Well.1                 <2           2.0     TRUE
#...
#21      1 Well.5               17.9          17.9    FALSE
#22      2 Well.5               22.7          22.7    FALSE
#23      3 Well.5                3.3           3.3    FALSE
#24      4 Well.5                8.4           8.4    FALSE
#25      5 Well.5                 <2           2.0     TRUE

# Assume normal distribution
#---------------------------

dev.new()
with(EPA.09.Ex.15.1.manganese.df,
qqPlotCensored(Manganese.ppb, Censored,
prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE,
main = paste("Normal Q-Q Plot of Manganese Data",
"Based on Kaplan-Meier Plotting Positions", sep = "\n")))

# Include max value in the plot
#------------------------------

dev.new()
with(EPA.09.Ex.15.1.manganese.df,
qqPlotCensored(Manganese.ppb, Censored,
prob.method = "modified kaplan-meier", points.col = "blue",
main = paste("Normal Q-Q Plot of Manganese Data",
"Based on Kaplan-Meier Plotting Positions",
"(Max Included)", sep = "\n")))

# Assume lognormal distribution
#------------------------------

dev.new()
with(EPA.09.Ex.15.1.manganese.df,
qqPlotCensored(Manganese.ppb, Censored, dist = "lnorm",
prob.method = "kaplan-meier", points.col = "blue", add.line = TRUE,
main = paste("Lognormal Q-Q Plot of Manganese Data",
"Based on Kaplan-Meier Plotting Positions", sep = "\n")))

# Include max value in the plot
#------------------------------

dev.new()
with(EPA.09.Ex.15.1.manganese.df,
qqPlotCensored(Manganese.ppb, Censored, dist = "lnorm",
prob.method = "modified kaplan-meier", points.col = "blue",
main = paste("Lognormal Q-Q Plot of Manganese Data",
"Based on Kaplan-Meier Plotting Positions",
"(Max Included)", sep = "\n")))

# The lognormal distribution appears to be a better fit.
# Now create a Q-Q plot assuming a gamma distribution.  Here we'll
# need to set estimate.params=TRUE.

dev.new()
with(EPA.09.Ex.15.1.manganese.df,
qqPlotCensored(Manganese.ppb, Censored, dist = "gamma",
estimate.params = TRUE, prob.method = "kaplan-meier",
points.col = "blue", add.line = TRUE,
main = paste("Gamma Q-Q Plot of Manganese Data",
"Based on Kaplan-Meier Plotting Positions", sep = "\n")))

# Include max value in the plot
#------------------------------

dev.new()
with(EPA.09.Ex.15.1.manganese.df,
qqPlotCensored(Manganese.ppb, Censored, dist = "gamma",
estimate.params = TRUE, prob.method = "modified kaplan-meier",
points.col = "blue", add.line = TRUE,
main = paste("Gamma Q-Q Plot of Manganese Data",
"Based on Kaplan-Meier Plotting Positions",
"(Max Included)", sep = "\n")))

#==========

# Clean up
#---------
graphics.off()


[Package EnvStats version 2.8.1 Index]