R: Box-Cox power transformation and its inverse

BoxCox {Ecfun}

R Documentation

Box-Cox power transformation and its inverse

Description

Box and Cox (1964) considered the following family of transformations indexed by lambda:

w = (y^lambda-1)/lambda

= expm1(lambda*log(y))/lambda,

with the lambda=0 case defined as log(y) to make w continuous in lambda for constant y.

They estimate lambda assuming w follows a normal distribution. This raises a theoretical problem in that y must be positive, which means that w must follow a truncated normal distribution conditioned on lambda*w > (-1).

Bickel and Doksum (1981) removed the restriction to positive y, i.e., to w > (-1/lambda) by modifying the transformation as follows:

w =

(sgn(y)*abs(y)^lambda-1)/lambda if lambda != 0 and

sgn(y)*log(abs(y)) if lambda = 0,

where sgn(y) = 1 if y >= 0 and -1 otherwise.

NOTE: sgn(y) is different from sign(y), which is 0 for y = 0. A two-argument update to the sign function in the base package has been added to this Ecfun package, so sign(y, 1) = sgn(y).

If (y<0), this transformation is discontinuous at lambda = 0. To see this, we rewrite this as

w = (sgn(y)*expm1(lambda*log(abs(y))) + (sgn(y)-1)) / lambda

= sgn(y)*(log(abs(y)) + O(lambda) + (sgn(y)-1)/lambda,

where O(lambda) indicates a term that is dominated by a constant times lambda.

If y<0, this latter term (sgn(y)-1)/lambda = (-2)/lambda and becomes Inf as lambda -> 0.

In practice, we assume that y > 0, so this distinction has little practical value. However, the BoxCox function computes the Bickel-Doksum version.

Box and Cox further noted that proper estimation of lambda should include the Jacobian of the transformation in the log(likelihood). Doing this can be achieved by rescaling the transformation with the nth root of the Jacobian, which can be written as follows:

j(y, lambda) = J(y, lambda)^(1/n) = GeometricMean(y)^(lambda-1).

With this the rescaled power transformation is as follows:

z = (y^lambda-1) / (lambda*j(y, lambda) if lambda!=0 or GeometricMean(y)*log(y) if lambda==0.

In addition to facilitating estimation of lambda, rescaling has the advantage that the units of z are the same as the units of y.

The output has class BoxCox, which has attributes that allow the input to be recovered using invBoxCox. The default values of the arguments of invBoxCox are provided by the corresponding attributes of z.

Usage

  BoxCox(y, lambda, rescale=TRUE, na.rm=rescale) 
  invBoxCox(z, lambda, sign.y, GeometricMean, rescale)

Arguments

`y`	a numeric vector for which the power transform is desired
`lambda`	A numeric vector of length 1 or 2. The first component is the power. If the second component is provided, `y` is replaced by `y+lambda[2]`.
`rescale`	logical or numeric. If logical: For `BoxCox`, this is `TRUE` to return the power transform with rescale, `z`, above, and `FALSE` to return the power transform without the `n`th root of the Jacobian, `w`, above. This defaults to `TRUE`, because this will give `z` the same units as `y`. For `invBoxCox`, this is `TRUE` if the input argument `z` is assumed to have been rescaled by the `n`th root of the Jacobian of the transformation. This defaults to a `rescale` attribute of `z` if present or to `TRUE` if absent. If numeric, it is assumed to be the geometric mean of another set of y values to use with new `y`'s.
`na.rm`	logical: `TRUE` to remove `NA`s from `y` before computing the geometric mean. `FALSE` to compute `NA` for the geometric mean if `any(is.na(y))`. NOTE: If `na.rm` = `FALSE`, the output will be all `NA` if `rescale` = `TRUE`. This could produce non usable answers in most cases. To avoid that, the default for `na.rm` is `TRUE` whenever `rescale` = `TRUE`. Conversely, applications using `na.rm` = `FALSE` will likely also want `rescale` = `FALSE` to avoid returning a non-answer in these cases. This explains the default `na.rm` = `rescale`.
`z`	a numeric vector or an object of class `BoxCox` for which the inverse Box-Cox transform is desired.
`sign.y`	an optional logical vector giving `sign(y-lambda[2])` of the data values that presumably generated `z`. Defaults to an `sign.y` attribute of `z` or to `rep(1, length(z))` if no such attribute is present.
`GeometricMean`	an optional numeric scalar giving the geometric mean of the data values that presumably generated `z`. Defaults to a `GeometricMean` attribute of `z` or to 1 if no such attribute is present.

Details

Box and Cox (1964) discussed

w(y, lambda) = (y^lambda - 1)/lambda.

They noted that w is continuous in lambda with w(y, lambda) = log(y) if lambda = 0 (by l'Hopital's rule).

They also discussed

z(y, lambda) = (y^lambda - 1) / (lambda*g^(lambda-1)),

where g = the geometric mean of y.

They noted that proper estimation of lambda should include the Jacobian of w(y, lambda) with the likelihood. They further showed that a naive normal likelihood using z(y, lambda) as the response without a Jacobian is equivalent to the normal likelihood using w(y, lambda) adjusted appropriately using the Jacobian. See Box and Cox (1964) or the Wikipedia article on "Power transform".

Bickel and Doksum (1981) suggested adding sign(y) to the transformation, as discussed above.

NUMERICAL ANALYSIS:

Consider the Bickel and Doksum version described above:

w <- (sign(y)*abs(y)^lambda-1)/lambda

if(any(y==0)), GeometricMean(y) = 0. This creates a problem with the above math.

Let ly = log(abs(y)). Then with la = lambda,

w = (sign(y)*exp(la*ly)-1)/la

= sign(y) * ly * (1+(la*ly/2) * (1+(la*ly/3)*(1+(la*ly/4)*(1+O(la*ly))))) + (sign(y)-1)/la

For y>0, the last term is zero. boxcox ignores cases with y<=0 and uses this formula (ignoring the final O(la*ly)) whenever abs(la) <= eps = 1/50. That form is used here also.

For invBoxCox a complementary analysis is as follows:

abs(y+lambda[2]) = abs(1+la*w)^(1/la)

= exp(log1p(la*w)/la) for abs(la*w)<1

= w * (1-la*w * ((1/2)-la*w * ((1/3)-la*w*(1/4-...))))

Value

BoxCox returns an object of class BoxCox, being a numeric vector of the same length as y with the following optional attributes:

lambda the value of lambda used in the transformation
sign.y sign(y) (or sign(y-lambda[2]) lambda[2] is provided and if any of these quantities are negative. Otherwise, this is omitted and all are assumed to be positive.
rescale logical: TRUE if z(y, lambda) is returned rescaled by g^(lambda-1) with g = the geometric mean of y

and FALSE if z(y, lambda) is not so rescaled.
GeometricMean If rescale is numeric, attr(., 'GeometricMean') <- rescale.

Otherwise, attr(., 'GeometricMean') is the Geometric mean of abs(y) = exp(mean(log(abs(y)))) or of abs(y+lambda[2]) if(length(lambda)>1).

invBoxCox returns a numeric vector, reconstructing y from BoxCox(y, ...).

Source

Bickel, Peter J., and Doksum, Kjell A. (1981) "An analysis of transformation revisited", Journal of the American Statistical Association, 76 (374): 296-311

Box, George E. P.; Cox, D. R. (1964). "An analysis of transformations", Journal of the Royal Statistical Society, Series B 26 (2): 211-252.

Box, George E. P.; Cox, D. R. (1982). "An analysis of transformations revisited, rebutted", Journal of the American Statistical Association, 77(377): 209-210.

References

Wikipedia, "Power transform"

Examples

##
## 1.  A simple example to check the two algorithms 
##
Days <- 0:9
bc1 <- BoxCox(Days, c(0.01, 1))
# Taylor expansion used for obs 1:7; expm1 for 8:10 

# check 
GM <- exp(mean(log(abs(Days+1))))

bc0 <- (((Days+1)^0.01)-1)/0.01
bc1. <- (bc0 / (GM^(0.01-1)))  
# log(Days+1) ranges from 0 to 4.4 
# lambda = 0.01 will invoke both the obvious
# algorithm and the alternative assumed to be 
# more accurate for (lambda(log(y)) < 0.02).  
attr(bc1., 'lambda') <- c(0.01, 1)  
attr(bc1., 'rescale') <- TRUE 
attr(bc1., 'GeometricMean') <- GM 
class(bc1.) <- 'BoxCox'


all.equal(bc1, bc1.)


##
## 2.  another simple example with lambda=0
##
bc0.4 <- BoxCox(1:5, 0)
GM5 <- prod(1:5)^.2
bc0.4. <- log(1:5)*GM5
attr(bc0.4., 'lambda') <- 0  
attr(bc0.4., 'rescale') <- TRUE 
attr(bc0.4., 'GeometricMean') <- GM5 
class(bc0.4.) <- 'BoxCox'


all.equal(bc0.4, bc0.4.)


bc0.4e9 <- BoxCox(1:5, .Machine$double.eps)
bc0.4ex <- log(1:5)*exp(mean(log(1:5)))

all.equal(bc0.4ex, as.numeric(bc0.4e9))


# now invert:  

bc0.4i <- invBoxCox(bc0.4.)

all.equal(1:5, bc0.4i)



all.equal(1:5, invBoxCox(bc0.4e9))


##
## 3.  The "boxcox" function in the MASS package 
##     computes a maximum likelihood estimate with  
##     BoxCox(Days+1, lambda=0.21) 
##     with a 95 percent confidence interval of 
##     approximately (0.08, 0.35)
##
bcDays1 <- BoxCox(MASS::quine$Days, c(0.21, 1))

# check 
GeoMean <- exp(mean(log(abs(MASS::quine$Days+1))))

bcDays1. <- ((((MASS::quine$Days+1)^0.21)-1) / 
               (0.21*GeoMean^(0.21-1)))  
# log(Days+1) ranges from 0 to 4.4 
attr(bcDays1., 'lambda') <- c(0.21, 1)  
attr(bcDays1., 'rescale') <- TRUE 
attr(bcDays1., 'GeometricMean') <- GeoMean 
class(bcDays1.) <- 'BoxCox'


all.equal(bcDays1, bcDays1.)


iDays <- invBoxCox(bcDays1)

all.equal(iDays, MASS::quine$Days)


##
## 4.  Easily computed example 
##
bc2 <- BoxCox(c(1, 4), 2)

# check 
bc2. <- (c(1, 4)^2-1)/4
attr(bc2., 'lambda') <- 2
attr(bc2., 'rescale') <- TRUE 
attr(bc2., 'GeometricMean') <- 2 
class(bc2.) <- 'BoxCox'


all.equal(bc2, bc2.)



all.equal(invBoxCox(bc2), c(1, 4))


##
## 5.  plot(BoxCox())
##
y0 <- seq(-2, 2, .1)
z2 <- BoxCox(y0, 2, rescale=FALSE)
plot(y0, z2)

# check 
z2. <- (sign(y0)*y0^2-1)/2

attr(z2., 'lambda') <- 2
attr(z2., 'sign.y') <- sign(y0, 1)
attr(z2., 'rescale') <- FALSE 
attr(z2., 'GeometricMean') <- 0
class(z2.) <- 'BoxCox'


all.equal(z2, z2.)



all.equal(invBoxCox(z2), y0)

[Package Ecfun version 0.3-2 Index]