boxcoxCensored {EnvStats}  R Documentation 
Boxcox Power Transformation for Type I Censored Data
Description
Compute the value(s) of an objective for one or more BoxCox power transformations, or to compute an optimal power transformation based on a specified objective, based on Type I censored data.
Usage
boxcoxCensored(x, censored, censoring.side = "left",
lambda = {if (optimize) c(2, 2) else seq(2, 2, by = 0.5)}, optimize = FALSE,
objective.name = "PPCC", eps = .Machine$double.eps,
include.x.and.censored = TRUE, prob.method = "michaelschucany",
plot.pos.con = 0.375)
Arguments
x 
a numeric vector of positive numbers.
Missing ( 
censored 
numeric or logical vector indicating which values of 
censoring.side 
character string indicating on which side the censoring occurs. The possible values are

lambda 
numeric vector of finite values indicating what powers to use for the
BoxCox transformation. When 
optimize 
logical scalar indicating whether to simply evalute the objective function at the
given values of 
objective.name 
character string indicating what objective to use. The possible values are

eps 
finite, positive numeric scalar. When the absolute value of 
include.x.and.censored 
logical scalar indicating whether to include the finite, nonmissing values of
the argument 
prob.method 
for multiply censored data,
character string indicating what method to use to compute the plotting positions
(empirical probabilities) when
The default value is The This argument is ignored if 
plot.pos.con 
for multiply censored data,
numeric scalar between 0 and 1 containing the value of the plotting position
constant when This argument is ignored if 
Details
Two common assumptions for several standard parametric hypothesis tests are:
The observations all come from a normal distribution.
The observations all come from distributions with the same variance.
For example, the standard onesample ttest assumes all the observations come from the same normal distribution, and the standard twosample ttest assumes that all the observations come from a normal distribution with the same variance, although the mean may differ between the two groups.
When the original data do not satisfy the above assumptions, data transformations
are often used to attempt to satisfy these assumptions.
Box and Cox (1964) presented a formalized method for deciding on a data
transformation. Given a random variable X
from some distribution with
only positive values, the BoxCox family of power transformations is defined as:
Y  =  \frac{X^\lambda  1}{\lambda}  \lambda \ne 0 
log(X)  \lambda = 0 \;\;\;\;\;\; (1)

where Y
is assumed to come from a normal distribution. This transformation is
continuous in \lambda
. Note that this transformation also preserves ordering.
See the help file for boxcoxTransform
for more information on data
transformations.
Box and Cox (1964) proposed choosing the appropriate value of \lambda
based on
maximizing the likelihood function. Alternatively, an appropriate value of
\lambda
can be chosen based on another objective, such as maximizing the
probability plot correlation coefficient or the ShapiroWilk goodnessoffit
statistic.
Shumway et al. (1989) investigated extending the method of Box and Cox (1964) to the case of Type I censored data, motivated by the desire to produce estimated means and confidence intervals for air monitoring data that included censored values.
In the case when optimize=TRUE
, the function boxcoxCensored
calls the
R function nlminb
to minimize the negative value of the
objective (i.e., maximize the objective) over the range of possible values of
\lambda
specified in the argument lambda
. The starting value for
the optimization is always \lambda=1
(i.e., no transformation).
The next section explains assumptions and notation, and the section after that
explains how the objective is computed for the various options for
objective.name
.
Assumptions and Notation
Let \underline{x}
denote a random sample of N
observations from
some continuous distribution. Assume n
(0 < n < N
) of these
observations are known and c
(c=Nn
) of these observations are
all censored below (leftcensored) or all censored above (rightcensored) at
k
fixed censoring levels
T_1, T_2, \ldots, T_K; \; K \ge 1 \;\;\;\;\;\; (2)
For the case when K \ge 2
, the data are said to be Type I
multiply censored. For the case when K=1
,
set T = T_1
. If the data are leftcensored
and all n
known observations are greater
than or equal to T
, or if the data are rightcensored and all n
known observations are less than or equal to T
, then the data are
said to be Type I singly censored (Nelson, 1982, p.7), otherwise
they are considered to be Type I multiply censored.
Let c_j
denote the number of observations censored below or above censoring
level T_j
for j = 1, 2, \ldots, K
, so that
\sum_{i=1}^K c_j = c \;\;\;\;\;\; (3)
Let x_{(1)}, x_{(2)}, \ldots, x_{(N)}
denote the “ordered” observations,
where now “observation” means either the actual observation (for uncensored
observations) or the censoring level (for censored observations). For
rightcensored data, if a censored observation has the same value as an
uncensored one, the uncensored observation should be placed first.
For leftcensored data, if a censored observation has the same value as an
uncensored one, the censored observation should be placed first.
Note that in this case the quantity x_{(i)}
does not necessarily represent
the i
'th “largest” observation from the (unknown) complete sample.
Finally, let \Omega
(omega) denote the set of n
subscripts in the
“ordered” sample that correspond to uncensored observations, and let
\Omega_j
denote the set of c_j
subscripts in the “ordered”
sample that correspond to the censored observations censored at censoring level
T_j
for j = 1, 2, \ldots, k
.
We assume that there exists some value of \lambda
such that the transformed
observations
y_i  =  \frac{x_i^\lambda  1}{\lambda}  \lambda \ne 0 
log(x_i)  \lambda = 0 \;\;\;\;\;\; (4)

(i = 1, 2, \ldots, n
) form a random sample of Type I censored data from a
normal distribution.
Note that for the censored observations, Equation (4) becomes:
y_{(i)} = T_j^*  =  \frac{T_j^\lambda  1}{\lambda}  \lambda \ne 0 
log(T_j)  \lambda = 0 \;\;\;\;\;\; (5)

where i \in \Omega_j
.
Computing the Objective
Objective Based on Probability Plot Correlation Coefficient (objective.name="PPCC"
)
When objective.name="PPCC"
, the objective is computed as the value of the
normal probability plot correlation coefficient based on the transformed data
(see the description of the Probability Plot Correlation Coefficient (PPCC)
goodnessoffit test in the help file for gofTestCensored
). That is,
the objective is the correlation coefficient for the normal
quantilequantile plot for the transformed data.
Large values of the PPCC tend to indicate a good fit to a normal distribution.
Objective Based on ShapiroWilk GoodnessofFit Statistic (objective.name="ShapiroWilk"
)
When objective.name="ShapiroWilk"
, the objective is computed as the value of
the ShapiroWilk goodnessoffit statistic based on the transformed data
(see the description of the ShapiroWilk test in the help file for
gofTestCensored
). Large values of the ShapiroWilk statistic tend to
indicate a good fit to a normal distribution.
Objective Based on LogLikelihood Function (objective.name="LogLikelihood"
)
When objective.name="LogLikelihood"
, the objective is computed as the value
of the loglikelihood function. Assuming the transformed observations in
Equation (4) above come from a normal distribution with mean \mu
and
standard deviation \sigma
, we can use the change of variable formula to
write the loglikelihood function as follows.
For Type I left censored data, the likelihood function is given by:
log[L(\lambda, \mu, \sigma)] = log[{N \choose c_1 c_2 \ldots c_k n}] + \sum_{j=1}^k c_j log[F(T_j^*)] + \sum_{i \in \Omega} log\{f[y_{(i)}]\} + (\lambda  1) \sum_{i \in \Omega} log[x_{(i)}] \;\;\;\;\;\; (6)
where f
and F
denote the probability density function (pdf) and
cumulative distribution function (cdf) of the population. That is,
f(t) = \phi(\frac{t\mu}{\sigma}) \;\;\;\;\;\; (7)
F(t) = \Phi(\frac{t\mu}{\sigma}) \;\;\;\;\;\; (8)
where \phi
and \Phi
denote the pdf and cdf of the standard normal
distribution, respectively (Shumway et al., 1989). For left singly
censored data, Equation (6) simplifies to:
log[L(\lambda, \mu, \sigma)] = log[{N \choose c}] + c log[F(T^*)] + \sum_{i = c+1}^N log\{f[y_{(i)}]\} + (\lambda  1) \sum_{i = c+1}^N log[x_{(i)}] \;\;\;\;\;\; (9)
Similarly, for Type I right censored data, the likelihood function is given by:
log[L(\lambda, \mu, \sigma)] = log[{N \choose c_1 c_2 \ldots c_k n}] + \sum_{j=1}^k c_j log[1  F(T_j^*)] + \sum_{i \in \Omega} log\{f[y_{(i)}]\} + (\lambda  1) \sum_{i \in \Omega} log[x_{(i)}] \;\;\;\;\;\; (10)
and for right singly censored data this simplifies to:
log[L(\lambda, \mu, \sigma)] = log[{N \choose c}] + c log[1  F(T^*)] + \sum_{i = 1}^n log\{f[y_{(i)}]\} + (\lambda  1) \sum_{i = 1}^n log[x_{(i)}] \;\;\;\;\;\; (11)
For a fixed value of \lambda
, the loglikelihood function
is maximized by replacing \mu
and \sigma
with their maximum likelihood
estimators (see the section Maximum Likelihood Estimation in the help file
for enormCensored
).
Thus, when optimize=TRUE
, Equation (6) or (10) is maximized by iteratively
solving for \lambda
using the MLEs for \mu
and \sigma
.
When optimize=FALSE
, the value of the objective is computed by using
Equation (6) or (10), using the values of \lambda
specified in the
argument lambda
, and using the MLEs of \mu
and \sigma
.
Value
boxcoxCensored
returns a list of class "boxcoxCensored"
containing the results.
See the help file for boxcoxCensored.object
for details.
Note
Data transformations are often used to induce normality, homoscedasticity, and/or linearity, common assumptions of parametric statistical tests and estimation procedures. Transformations are not “tricks” used by the data analyst to hide what is going on, but rather useful tools for understanding and dealing with data (Berthouex and Brown, 2002, p.61). Hoaglin (1988) discusses “hidden” transformations that are used everyday, such as the pH scale for measuring acidity. Johnson and Wichern (2007, p.192) note that "Transformations are nothing more than a reexpression of the data in different units."
Shumway et al. (1989) investigated extending the method of Box and Cox (1964) to the case of Type I censored data, motivated by the desire to produce estimated means and confidence intervals for air monitoring data that included censored values.
Stoline (1991) compared the goodnessoffit of BoxCox transformed data (based on
using the “optimal” power transformation from a finite set of values between
1.5 and 1.5) with logtransformed data for 17 groundwater chemistry variables.
Using the Probability Plot Correlation Coefficient statistic for censored data as a
measure of goodnessoffit (see gofTest
), Stoline (1991) found that
only 6 of the variables were adequately modeled by a BoxCox transformation
(p >0.10 for these 6 variables). Of these variables, five were adequately modeled
by a a log transformation. Ten of variables were “marginally” fit by an
optimal BoxCox transformation, and of these 10 only 6 were marginally fit by a
log transformation. Based on these results, Stoline (1991) recommends checking
the assumption of lognormality before automatically assuming environmental data fit
a lognormal distribution.
One problem with data transformations is that translating results on the
transformed scale back to the original scale is not always straightforward.
Estimating quantities such as means, variances, and confidence limits in the
transformed scale and then transforming them back to the original scale
usually leads to biased and inconsistent estimates (Gilbert, 1987, p.149;
van Belle et al., 2004, p.400). For example, exponentiating the confidence
limits for a mean based on logtransformed data does not yield a
confidence interval for the mean on the original scale. Instead, this yields
a confidence interval for the median (see the help file for
elnormAltCensored
).
It should be noted, however, that quantiles (percentiles) and rankbased
procedures are invariant to monotonic transformations
(Helsel and Hirsch, 1992, p.12).
Finally, there is no guarantee that a BoxCox tranformation based on the
“optimal” value of \lambda
will provide an adequate transformation
to allow the assumption of approximate normality and constant variance. Any
set of transformed data should be inspected relative to the assumptions you
want to make about it (Johnson and Wichern, 2007, p.194).
Author(s)
Steven P. Millard (EnvStats@ProbStatInfo.com)
References
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B 26(2), 211–252.
Cohen, A.C. (1991). Truncated and Censored Samples. Marcel Dekker, New York, New York, pp.50–59.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.4753.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Hinkley, D.V., and G. Runger. (1984). The Analysis of Transformed Data (with Discussion). Journal of the American Statistical Association 79, 302–320.
Hoaglin, D.C., F.M. Mosteller, and J.W. Tukey, eds. (1983). Understanding Robust and Exploratory Data Analysis. John Wiley and Sons, New York, Chapter 4.
Hoaglin, D.C. (1988). Transformations in Everyday Experience. Chance 1, 40–45.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York, p.163.
Johnson, R.A., and D.W. Wichern. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Pearson Prentice Hall, Upper Saddle River, NJ, pp.192–195.
Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.
Stoline, M.R. (1991). An Examination of the Lognormal and Box and Cox Family of Transformations in Fitting Environmental Data. Environmetrics 2(1), 85–106.
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. PrenticeHall, Upper Saddle River, NJ, Chapter 13.
See Also
boxcoxCensored.object
, plot.boxcoxCensored
,
print.boxcoxCensored
,
boxcox
, Data Transformations, GoodnessofFit Tests.
Examples
# Generate 15 observations from a lognormal distribution with
# mean=10 and cv=2 and censor the observations less than 2.
# Then generate 15 more observations from this distribution and
# censor the observations less than 4.
# Then Look at some values of various objectives for various transformations.
# Note that for both the PPCC objective the optimal value is about 0.3,
# whereas for the LogLikelihood objective it is about 0.3.
# (Note: the call to set.seed simply allows you to reproduce this example.)
set.seed(250)
x.1 < rlnormAlt(15, mean = 10, cv = 2)
censored.1 < x.1 < 2
x.1[censored.1] < 2
x.2 < rlnormAlt(15, mean = 10, cv = 2)
censored.2 < x.2 < 4
x.2[censored.2] < 4
x < c(x.1, x.2)
censored < c(censored.1, censored.2)
#
# Using the PPCC objective:
#
boxcoxCensored(x, censored)
#Results of BoxCox Transformation
#Based on Type I Censored Data
#
#
#Objective Name: PPCC
#
#Data: x
#
#Censoring Variable: censored
#
#Censoring Side: left
#
#Censoring Level(s): 2 4
#
#Sample Size: 30
#
#Percent Censored: 26.7%
#
# lambda PPCC
# 2.0 0.8954683
# 1.5 0.9338467
# 1.0 0.9643680
# 0.5 0.9812969
# 0.0 0.9776834
# 0.5 0.9471025
# 1.0 0.8901990
# 1.5 0.8187488
# 2.0 0.7480494
boxcoxCensored(x, censored, optimize = TRUE)
#Results of BoxCox Transformation
#Based on Type I Censored Data
#
#
#Objective Name: PPCC
#
#Data: x
#
#Censoring Variable: censored
#
#Censoring Side: left
#
#Censoring Level(s): 2 4
#
#Sample Size: 30
#
#Percent Censored: 26.7%
#
#Bounds for Optimization: lower = 2
# upper = 2
#
#Optimal Value: lambda = 0.3194799
#
#Value of Objective: PPCC = 0.9827546
#
# Using the LogLikelihodd objective
#
boxcoxCensored(x, censored, objective.name = "LogLikelihood")
#Results of BoxCox Transformation
#Based on Type I Censored Data
#
#
#Objective Name: LogLikelihood
#
#Data: x
#
#Censoring Variable: censored
#
#Censoring Side: left
#
#Censoring Level(s): 2 4
#
#Sample Size: 30
#
#Percent Censored: 26.7%
#
# lambda LogLikelihood
# 2.0 95.38785
# 1.5 84.76697
# 1.0 75.36204
# 0.5 68.12058
# 0.0 63.98902
# 0.5 63.56701
# 1.0 66.92599
# 1.5 73.61638
# 2.0 82.87970
boxcoxCensored(x, censored, objective.name = "LogLikelihood",
optimize = TRUE)
#Results of BoxCox Transformation
#Based on Type I Censored Data
#
#
#Objective Name: LogLikelihood
#
#Data: x
#
#Censoring Variable: censored
#
#Censoring Side: left
#
#Censoring Level(s): 2 4
#
#Sample Size: 30
#
#Percent Censored: 26.7%
#
#Bounds for Optimization: lower = 2
# upper = 2
#
#Optimal Value: lambda = 0.3049744
#
#Value of Objective: LogLikelihood = 63.2733
#
# Plot the results based on the PPCC objective
#
boxcox.list < boxcoxCensored(x, censored)
dev.new()
plot(boxcox.list)
#Look at QQPlots for the candidate values of lambda
#
plot(boxcox.list, plot.type = "QQ Plots", same.window = FALSE)
#==========
# Clean up
#
rm(x.1, censored.1, x.2, censored.2, x, censored, boxcox.list)
graphics.off()