R: Error propagation using different methods

propagate {qpcR}

R Documentation

Error propagation using different methods

Description

A general function for the calculation of error propagation by Monte Carlo simulation, permutation and first/second-order Taylor expansion including covariances. Can be used for qPCR data, but any data that should be subjected to error propagation analysis will do. The different methods can be used for any expression based on either replicate or summary data (mean & standard deviation).

Usage

propagate(expr, data, type = c("raw", "stat"), second.order = TRUE,
          do.sim = FALSE, dist.sim = c("norm", "tnorm"), use.cov = FALSE, 
          nsim = 10000, do.perm = FALSE, perm.crit = NULL, ties = NULL, 
          nperm = 2000, alpha = 0.05, plot = TRUE, logx = FALSE, 
          verbose = FALSE, ...)

Arguments

`expr`	an expression, such as `expression(x/y)`.
`data`	a dataframe or matrix containing either a) the replicates in columns or b) the means in the first row and the standard deviations in the second row. The variable names must be defined in the column headers.
`type`	either `raw` if replicates are given, or `stat` if means and standard deviations are supplied.
`second.order`	logical. If `TRUE`, error propagation will be calculated with first AND second-order Taylor expansion. See 'Details'.
`do.sim`	logical. Should Monte Carlo simulation be applied?
`dist.sim`	`"norm"` will use a multivariate normal distribution, `"tnorm"` a multivariate truncated normal distribution. See 'Details'.
`use.cov`	logical or variance-covariance matrix with the same column descriptions and column order as `data`. If `TRUE` together with replicates, the covariances are calculated from these and used within Monte Carlo simulation and error propagation. If `type = "stat"`, a square variance-covariance matrix can be supplied in the right dimensions (n x n, n = number of variables). If `FALSE`, Monte Carlo simulation and error propagation use only the diagonal (variances).
`nsim`	the number of simulations to be performed, minimum is 5000.
`do.perm`	logical. Should permutation error analysis be applied?
`perm.crit`	a character string of one or more criteria defining the null hypothesis for the permutation p-value. See 'Details'.
`ties`	a vector defining the columns that should be tied together for the permutations. See 'Details'.
`nperm`	the number of permutations to be performed.
`alpha`	the confidence level.
`plot`	logical. Should histograms with confidence intervals (in blue) be plotted for all methods?
`logx`	logical. Should the x-axis of the graphs have logarithmic scale?
`verbose`	logical. If `TRUE`, a longer output is given including the simulated data, derivatives, covariance matrix, etc.
`...`	other parameters to be supplied to `hist`, `boxplot` or `abline`.

Details

The implemented methods are:

1) Monte Carlo simulation:
For each variable in data, simulated data with nsim samples is generated from a multivariate (truncated) normal distribution using mean \mu and standard deviation \sigma of each variable. All data is coerced into a new dataset that has the same covariance structure as the initial data. Each row of the simulated dataset is evaluated and summary statistics are calculated. In scenarios that are nonlinear in nature the distribution of the result values can be skewed, mainly due to the simulated values at the extreme end of the normal distribution. Setting dist.sim = "tnorm" will fit a multivariate normal distribution, calculate the lower/upper 2.5% quantile on each side for each input variable and use these as bounds for simulating from a multivariate truncated normal distribution. This will (in part) remove some of the skewness in the result distribution.

2) Permutation approach:
The original data is permutated nperm times by binding observations together according to ties. The ties bind observations that can be independent measurements from the same sample. In qPCR terms, this would be a real-time PCR for two different genes on the same sample. If ties are omitted, the observations are shuffled independently. In detail, two datasets are created for each permutation: Dataset1 samples the rows (replicates) of the data according to ties. Dataset2 is obtained by sampling the columns (samples), also binding columns as defined in ties. For both datasets, the permutations are evaluated and statistics are collected. A confidence interval is calculated from all evaluations of Dataset1. A p-value is calculated from all permutations that follow perm.crit, whereby init reflects the permutations of the initial data and perm the permutations of the randomly reallocated samples. Thus, the p-value gives a measure against the null hypothesis that the result in the initial group is just by chance. See also 'Examples'.
The criterion for the permutation p-value (perm.crit) has to be defined by the user. For example, let's say we calculate some value 0.2 which is a ratio between two groups. We would hypothesize that by randomly reallocating the values between the groups, the mean values are not equal or smaller than in the initial data. We would thus define perm.crit as "perm < init" meaning that we want to test if the mean of the initial data (init) is frequently smaller than by the randomly allocated data (perm). The default (NULL) is to test all three variants "perm > init", "perm == init" and "perm < init".

3) Error propagation:
The propagated error is calculated by first and second-order Taylor expansion using matrix algebra. Often omitted, but important in models where the variables are correlated, is the second covariance term:

\sigma_y^2 = \underbrace{\sum_{i=1}^N\left(\frac{\partial f}{\partial x_i} \right)^2 \sigma_i^2}_{\rm{variance}} + \underbrace{2\sum_{i=1\atop i \neq j}^N\sum_{j=1\atop j \neq i}^N\left(\frac{\partial f}{\partial x_i} \right)\left(\frac{\partial f}{\partial x_j} \right) \sigma_{ij}}_{\rm{covariance}}

propagate calculates the propagated error either with or without covariances using matrix algebra for first- and second-order (since version 1.3-8) Taylor expansion.
First-order:
\mathrm{E}(y) = f(\bar{x}_i)
\mathbf{C}_y = \nabla_x\mathbf{C}_x\nabla_x^T
Second-order:
\mathrm{E}(y) = f(\bar{x}_i) + \frac{1}{2}[tr(\mathbf{H}_{xx}\mathbf{C}_x)]
\mathbf{C}_y = \nabla_x\mathbf{C}_x\nabla_x^T + \frac{1}{2}[tr(\mathbf{H}_{xx}\mathbf{C}_x\mathbf{H}_{xx}\mathbf{C}_x)]

with \mathrm{E}(y) = expectation of y, \mathbf{C}_y = variance of y, \nabla_x = the p x n gradient matrix with all partial first derivatives, \mathbf{C}_x = the p x p covariance matrix, \mathbf{H}_{xx} the Hessian matrix with all partial second derivatives and tr(\cdot) = the trace (sum of diagonal) of the matrix. For a detailed derivation, see 'References'.
The second-order Taylor expansion (second.order = TRUE) corrects for bias in nonlinear expressions as the first-order Taylor expansion assumes linearity around \bar{x}_i.
Depending on the input formula, the error propagation may result in an error that is not normally distributed. The Monte Carlo simulation, starting with normal distributions of the variables, can clarify this. A high tendency from deviation of normality is encountered in formulas in which the error of the denominator is relatively high or in exponential models where the exponent has a high error. This is one of the problems that is inherent in real-time PCR analysis, as the classical ratio calculation with efficiencies (i.e. by the delta-ct method) is of an exponential type.

Value

A plot containing histograms of the Monte Carlo simulation, the permutation values and error propagation. Additionally inserted are a boxplot, median values in red and confidence intervals as blue borders.

A list with the following components (if verbose = TRUE):

`data.Sim`	the Monte Carlo simulated data with evaluations in the last column.
`data.Perm`	the data of the permutated observations and samples with corresponding evaluations and the decision according to `perm.crit`.
`data.Prop`	`nsim` values generated from a normal distribution with mean and s.d. as calculated from the propagated error.
`gradient`	the evaluated gradient vector `\nabla_x` of partial first derivatives.
`hessian`	the evaluated hessian matrix `\mathbf{H}_{xx}` of partial second derivatives.
`covMat`	the covariance matrix `\mathbf{C}_x` used for Monte Carlo simulation and error propagation.
`summary`	a summary of the collected statistics, given as a dataframe. These are: mean, s.d. median, mad, lower/upper confidence interval and permutation p-values.

Author(s)

Andrej-Nikolai Spiess

References

Error propagation (in general):
An Introduction to error analysis.
Taylor JR.
University Science Books (1996), New York.

Evaluation of measurement data - Guide to the expression of uncertainty in measurement.
JCGM 100:2008 (GUM 1995 with minor corrections).
https://www.bipm.org/utils/common/documents/jcgm/JCGM_100_2008_E.pdf.

Higher-order Taylor expansion:
On higher-order corrections for propagating uncertainties.
Wang CM & Iyer HK.
Metrologia (2005), 42: 406-410.

Accuracy of error propagation exemplified with ratios of random variables.
Winzer PJ.
Rev Sci Instrum (2000), 72: 1447-1454.

Matrix algebra for error propagation:
An Introduction to Error Propagation: Derivation, Meaning and Examples of Equation Cy = FxCxFx^t.
www.nada.kth.se/~kai-a/papers/arrasTR-9801-R3.pdf.

Second order nonlinear uncertainty modeling in strapdown integration using MEMS IMUs.
Zhang M, Hol JD, Slot L, Luinge H.
2011 Proceedings of the 14th International Conference on Information Fusion (FUSION) (2011).

Error propagation (in qPCR):
Error propagation in relative real-time reverse transcription polymerase chain reaction quantification models: The balance between accuracy and precision.
Nordgard O, Kvaloy JT, Farmen RK, Heikkila R.
Anal Biochem (2006), 356: 182-193.

qBase relative quantification framework and software for management and analysis of real-time quantitative PCR data.
Hellemans J, Mortier G, De Paepe A, Speleman F, Vandesompele J.
Genome Biol (2007), 8: R19.

Multivariate normal distribution:
Stochastic Simulation.
Ripley BD.
Stochastic Simulation (1987). Wiley. Page 98.

Testing for normal distribution:
Testing for Normality.
Thode Jr. HC.
Marcel Dekker (2002), New York.

Approximating the Shapiro-Wilk W-test for non-normality.
Royston P.
Stat Comp (1992), 2: 117-119.

Examples

## From summary data just calculate 
## Monte-Carlo and propagated error.
EXPR <- expression(x/y)
x <- c(5, 0.01)
y <- c(1, 0.01)
DF <- cbind(x, y)
RES1 <- propagate(expr = EXPR, data = DF, type = "stat", 
                 do.sim = TRUE, verbose = TRUE)

## Do Shapiro-Wilks test on Monte Carlo evaluations 
## !maximum 5000 datapoints can be used!
## => p.value on border to non-normality
shapiro.test(RES1$data.Sim[1:5000, 3])
## How about a graphical analysis:
qqnorm(RES1$data.Sim[, 3])

## Using raw data
## If data is of unequal length,
## use qpcR:::cbind.na to avoid replication!
## Do permutations (swap x and y values)
## and simulations.
EXPR <- expression(x*y)
x <- c(2, 2.1, 2.2, 2, 2.3, 2.1)
y <- c(4, 4, 3.8, 4.1, 3.1)
DF <- qpcR:::cbind.na(x, y)  
RES2 <- propagate(EXPR, DF, type = "raw", do.perm = TRUE, 
                 do.sim = TRUE, verbose = TRUE)
RES2$summary

## For replicate data, using relative 
## quantification ratio from qPCR.
## How good is the estimation of the propagated error?
## Done without using covariance in the 
## calculation and simulation.
## cp's and efficiencies are tied together
## because they are two observations on the
## same sample!
## As we are using an exponential type function,
## better to logarithmize the x-axis.
EXPR <- expression((E1^cp1)/(E2^cp2))
E1 <- c(1.73, 1.75, 1.77)
cp1 <- c(25.77,26.14,26.33)
E2 <-  c(1.72,1.68,1.65)
cp2 <- c(33.84,34.04,33.33)
DF <- cbind(E1, cp1, E2, cp2)
RES3 <- propagate(EXPR, DF, type = "raw", do.sim = TRUE,
                 do.perm = TRUE, verbose = TRUE, logx = TRUE)                 
## STRONG deviation from normality!
shapiro.test(RES3$data.Sim[1:5000, 5])
qqnorm(RES3$data.Sim[, 5])

## Same setup as above but also
## using a permutation approach
## for resampling the confidence interval.
## Cp's and efficiencies are tied together
## because they are two observations on the
## same sample! 
## Similar to what REST2008 software does.
RES4 <- propagate(EXPR, DF, type = "raw", do.sim = TRUE,
                 perm.crit = NULL, do.perm = TRUE, 
                 ties = c(1, 1, 2, 2), logx = TRUE, verbose = TRUE)
RES4$summary
## p-value of 0 in perm < init indicates that not a single 
## exchange of group memberships resulted in a smaller ratio!
              
## Proof that covariance of Monte-Carlo
## simulated dataset is the same as from 
## initial data.
RES4$covMat
cov(RES4$data.Sim[, 1:4])
all.equal(RES4$covMat, cov(RES4$data.Sim[, 1:4]))

[Package qpcR version 1.4-1 Index]