orderNorm {bestNormalize} | R Documentation |
Calculate and perform Ordered Quantile normalizing transformation
Description
The Ordered Quantile (ORQ) normalization transformation,
orderNorm()
, is a rank-based procedure by which the values of a
vector are mapped to their percentile, which is then mapped to the same
percentile of the normal distribution. Without the presence of ties, this
essentially guarantees that the transformation leads to a uniform
distribution.
The transformation is:
g(x) = \Phi ^ {-1} ((rank(x) - .5) /
(length(x)))
Where \Phi
refers to the standard normal cdf, rank(x) refers to each
observation's rank, and length(x) refers to the number of observations.
By itself, this method is certainly not new; the earliest mention of it that I could find is in a 1947 paper by Bartlett (see references). This formula was outlined explicitly in Van der Waerden, and expounded upon in Beasley (2009). However there is a key difference to this version of it, as explained below.
Using linear interpolation between these percentiles, the ORQ normalization becomes a 1-1 transformation that can be applied to new data. However, outside of the observed domain of x, it is unclear how to extrapolate the transformation. In the ORQ normalization procedure, a binomial glm with a logit link is used on the ranks in order to extrapolate beyond the bounds of the original domain of x. The inverse normal CDF is then applied to these extrapolated predictions in order to extrapolate the transformation. This mitigates the influence of heavy-tailed distributions while preserving the 1-1 nature of the transformation. The extrapolation will provide a warning unless warn = FALSE.) However, we found that the extrapolation was able to perform very well even on data as heavy-tailed as a Cauchy distribution (paper to be published).
The fit used to perform the extrapolation uses a default of 10000 observations (or length(x) if that is less). This added approximation improves the scalability, both computationally and in terms of memory used. Do not set this value to be too low (e.g. <100), as there is no benefit to doing so. Increase if your test data set is large relative to 10000 and/or if you are worried about losing signal in the extremes of the range.
This transformation can be performed on new data and inverted via the
predict
function.
Usage
orderNorm(x, n_logit_fit = min(length(x), 10000), ..., warn = TRUE)
## S3 method for class 'orderNorm'
predict(object, newdata = NULL, inverse = FALSE, warn = TRUE, ...)
## S3 method for class 'orderNorm'
print(x, ...)
Arguments
x |
A vector to normalize |
n_logit_fit |
Number of points used to fit logit approximation |
... |
additional arguments |
warn |
transforms outside observed range or ties will yield warning |
object |
an object of class 'orderNorm' |
newdata |
a vector of data to be (reverse) transformed |
inverse |
if TRUE, performs reverse transformation |
Value
A list of class orderNorm
with elements
x.t |
transformed original data |
x |
original data |
n |
number of nonmissing observations |
ties_status |
indicator if ties are present |
fit |
fit to be used for extrapolation, if needed |
norm_stat |
Pearson's P / degrees of freedom |
The predict
function returns the numeric value of the transformation
performed on new data, and allows for the inverse transformation as well.
References
Bartlett, M. S. "The Use of Transformations." Biometrics, vol. 3, no. 1, 1947, pp. 39-52. JSTOR www.jstor.org/stable/3001536.
Van der Waerden BL. Order tests for the two-sample problem and their power. 1952;55:453-458. Ser A.
Beasley TM, Erickson S, Allison DB. Rank-based inverse normal transformations are increasingly used, but are they merited? Behav. Genet. 2009;39(5): 580-595. pmid:19526352
See Also
boxcox
, lambert
,
bestNormalize
, yeojohnson
Examples
x <- rgamma(100, 1, 1)
orderNorm_obj <- orderNorm(x)
orderNorm_obj
p <- predict(orderNorm_obj)
x2 <- predict(orderNorm_obj, newdata = p, inverse = TRUE)
all.equal(x2, x)