boxcoxTransform {EnvStats}  R Documentation 
Apply a BoxCox power transformation to a set of data to attempt to induce normality and homogeneity of variance.
boxcoxTransform(x, lambda, eps = .Machine$double.eps)
x 
a numeric vector of positive numbers. 
lambda 
finite numeric scalar indicating what power to use for the BoxCox transformation. 
eps 
finite, positive numeric scalar. When the absolute value of 
Two common assumptions for several standard parametric hypothesis tests are:
The observations all come from a normal distribution.
The observations all come from distributions with the same variance.
For example, the standard onesample ttest assumes all the observations come from the same normal distribution, and the standard twosample ttest assumes that all the observations come from a normal distribution with the same variance, although the mean may differ between the two groups. For standard linear regression models, these assumptions can be stated as: the error terms all come from a normal distribution with mean 0 and and a constant variance.
Often, especially with environmental data, the above assumptions do not hold because the original data are skewed and/or they follow a distribution that is not really shaped like a normal distribution. It is sometimes possible, however, to transform the original data so that the transformed observations in fact come from a normal distribution or close to a normal distribution. The transformation may also induce homogeneity of variance and, for the case of a linear regression model, a linear relationship between the response and predictor variable(s).
Sometimes, theoretical considerations indicate an appropriate transformation. For example, count data often follow a Poisson distribution, and it can be shown that taking the square root of observations from a Poisson distribution tends to make these data look more bellshaped (Johnson et al., 1992, p.163; Johnson and Wichern, 2007, p.192; Zar, 2010, p.291). A common example in the environmental field is that chemical concentration data often appear to come from a lognormal distribution or some other positivelyskewed distribution (e.g., gamma). In this case, taking the logarithm of the observations often appears to yield normally distributed data.
Ideally, a data transformation is chosen based on knowledge of the process generating the data, as well as graphical tools such as quantilequantile plots and histograms.
Box and Cox (1964) presented a formalized method for deciding on a data
transformation. Given a random variable X
from some distribution with
only positive values, the BoxCox family of power transformations is defined as:
Y  =  \frac{X^\lambda  1}{\lambda}  \lambda \ne 0 
log(X)  \lambda = 0 \;\;\;\;\;\; (1)

where Y
is assumed to come from a normal distribution. This transformation is
continuous in \lambda
. Note that this transformation also preserves ordering;
that is, if X_1 < X_2
then Y_1 < Y_2
.
Box and Cox (1964) proposed choosing the appropriate value of \lambda
based on maximizing a likelihood function. See the help file for
boxcox
for details.
Note that for nonzero values of \lambda
, instead of using the formula of
Box and Cox in Equation (1), you may simply use the power transformation:
Y = X^\lambda \;\;\;\;\;\; (2)
since these two equations differ only by a scale difference and origin shift, and the essential character of the transformed distribution remains unchanged.
The value \lambda=1
corresponds to no transformation. Values of
\lambda
less than 1 shrink large values of X
, and are therefore
useful for transforming positivelyskewed (rightskewed) data. Values of
\lambda
larger than 1 inflate large values of X
, and are therefore
useful for transforming negativelyskewed (leftskewed) data
(Helsel and Hirsch, 1992, pp.1314; Johnson and Wichern, 2007, p.193).
Commonly used values of \lambda
include 0 (log transformation),
0.5 (squareroot transformation), 1 (reciprocal), and 0.5 (reciprocal root).
It is often recommend that when dealing with several similar data sets, it is best to find a common transformation that works reasonably well for all the data sets, rather than using slightly different transformations for each data set (Helsel and Hirsch, 1992, p.14; Shumway et al., 1989).
numeric vector of transformed observations.
Data transformations are often used to induce normality, homoscedasticity, and/or linearity, common assumptions of parametric statistical tests and estimation procedures. Transformations are not “tricks” used by the data analyst to hide what is going on, but rather useful tools for understanding and dealing with data (Berthouex and Brown, 2002, p.61). Hoaglin (1988) discusses “hidden” transformations that are used everyday, such as the pH scale for measuring acidity.
In the case of a linear model, there are at least two approaches to improving
a model fit: transform the Y
and/or X
variable(s), and/or use
more predictor variables. Often in environmental data analysis, we assume the
observations come from a lognormal distribution and automatically take
logarithms of the data. For a simple linear regression
(i.e., one predictor variable), if regression diagnostic plots indicate that a
straight line fit is not adequate, but that the variance of the errors
appears to be fairly constant, you may only need to transform the predictor
variable X
or perhaps use a quadratic or cubic model in X
.
On the other hand, if the diagnostic plots indicate that the constant
variance and/or normality assumptions are suspect, you probably need to consider
transforming the response variable Y
. Data transformations for
linear regression models are discussed in Draper and Smith (1998, Chapter 13)
and Helsel and Hirsch (1992, pp. 228229).
One problem with data transformations is that translating results on the
transformed scale back to the original scale is not always straightforward.
Estimating quantities such as means, variances, and confidence limits in the
transformed scale and then transforming them back to the original scale
usually leads to biased and inconsistent estimates (Gilbert, 1987, p.149;
van Belle et al., 2004, p.400). For example, exponentiating the confidence
limits for a mean based on logtransformed data does not yield a
confidence interval for the mean on the original scale. Instead, this yields
a confidence interval for the median (see the help file for elnormAlt
).
It should be noted, however, that quantiles (percentiles) and rankbased
procedures are invariant to monotonic transformations
(Helsel and Hirsch, 1992, p.12).
Steven P. Millard (EnvStats@ProbStatInfo.com)
Berthouex, P.M., and L.C. Brown. (2002). Statistics for Environmental Engineers, Second Edition. Lewis Publishers, Boca Raton, FL.
Box, G.E.P., and D.R. Cox. (1964). An Analysis of Transformations (with Discussion). Journal of the Royal Statistical Society, Series B 26(2), 211–252.
Draper, N., and H. Smith. (1998). Applied Regression Analysis. Third Edition. John Wiley and Sons, New York, pp.4753.
Gilbert, R.O. (1987). Statistical Methods for Environmental Pollution Monitoring. Van Nostrand Reinhold, NY.
Helsel, D.R., and R.M. Hirsch. (1992). Statistical Methods in Water Resources Research. Elsevier, New York, NY.
Hinkley, D.V., and G. Runger. (1984). The Analysis of Transformed Data (with Discussion). Journal of the American Statistical Association 79, 302–320.
Hoaglin, D.C., F.M. Mosteller, and J.W. Tukey, eds. (1983). Understanding Robust and Exploratory Data Analysis. John Wiley and Sons, New York, Chapter 4.
Hoaglin, D.C. (1988). Transformations in Everyday Experience. Chance 1, 40–45.
Johnson, N. L., S. Kotz, and A.W. Kemp. (1992). Univariate Discrete Distributions, Second Edition. John Wiley and Sons, New York, p.163.
Johnson, R.A., and D.W. Wichern. (2007). Applied Multivariate Statistical Analysis, Sixth Edition. Pearson Prentice Hall, Upper Saddle River, NJ, pp.192–195.
Shumway, R.H., A.S. Azari, and P. Johnson. (1989). Estimating Mean Concentrations Under Transformations for Environmental Data With Detection Limits. Technometrics 31(3), 347–356.
Stoline, M.R. (1991). An Examination of the Lognormal and Box and Cox Family of Transformations in Fitting Environmental Data. Environmetrics 2(1), 85–106.
van Belle, G., L.D. Fisher, Heagerty, P.J., and Lumley, T. (2004). Biostatistics: A Methodology for the Health Sciences, 2nd Edition. John Wiley & Sons, New York.
Zar, J.H. (2010). Biostatistical Analysis. Fifth Edition. PrenticeHall, Upper Saddle River, NJ, Chapter 13.
boxcox
, Data Transformations, GoodnessofFit Tests.
# Generate 30 observations from a lognormal distribution with
# mean=10 and cv=2, then look at some normal quantilequantile
# plots for various transformations.
# (Note: the call to set.seed simply allows you to reproduce this example.)
set.seed(250)
x < rlnormAlt(30, mean = 10, cv = 2)
dev.new()
qqPlot(x, add.line = TRUE)
dev.new()
qqPlot(boxcoxTransform(x, lambda = 0.5), add.line = TRUE)
dev.new()
qqPlot(boxcoxTransform(x, lambda = 0), add.line = TRUE)
# Clean up
#
rm(x)