cvq2-package {cvq2} | R Documentation |

This package compares observation with their predictions calculated by model `M`.
It calculates the predictive squared correlation coefficient, *q^2*, in comparison to the well known conventional squared correlation coefficient, *r^2*.

Package: | cvq2 |

Type: | Package |

Version: | 1.2.0 |

Date: | 2013-10-10 |

Depends: | methods, stats |

License: | GPL v3 |

LazyLoad: | yes |

This package needs either a description of parameters and observations (I) or a data set that already contains the observations and their related predictions (II).
In case of (I), a linear model `M` is generated on the fly.
Afterwards, its calibration performance can be compared with its prediction power.
If the input data consist of observations and precidctions only (II), the package can be used to compute either the calibration performance or the prediction power.
If model `M` is generated on the fly (I), the procedure is as follows:
The input data set consists of parameters *x_1, x_2 …, x_n* which describe an observation `y`.
A linear regression (`glm`

) of this data set yields to `M`.
Thus the conventional squared correlation coefficient, *r^2*, can be calculated:

* q^2 = 1 - (SIGMA_i=1^N (y_i^fit - y_i)^2 / (SIGMA_i=1^N (y_i - y_mean)^2) \u2261 1 - RSS/SS*

The denominator complies with the **R**esidual **S**um of **S**quares *RSS*, the difference between the fitted values *y_i^fit* predicted by `M` and the observations *y_i*.
The numerator is the **S**um of **S**quares, *SS*, and refers to the difference between the observations *y_i* and their mean *y_mean*.

To compare the calibration of `M` with its prediction power, `M` is applied to an external data set.
External it is called, because these data have not been used during the linear regression to generate `M`.
The comparison of the predictions *y_i^pred* with the observation *y_i* yields to the predictive squared correlation coefficient, *q^2*:

* q^2 = 1 - (SIGMA_i=1^N (y_i^pred - y_i)^2 / (SIGMA_i=1^N (y_i - y_mean)^2) \u2261 1 - PRESS/SS*

The **PRE**dictive residual **S**um of **S**quares *PRESS* is the difference between the predictions *y_i^pred* and the observations *y_i*.
The **S**um of **S**quares *SS* refers to the difference between the observations *y_i* and their mean *y_mean*.

In case that no external data set is available, one can perform a cross-validation to evaluate the prediction performance.
The cross-validation splits the model data set (*N* elements) into a training set (*N-k* elements) and a test set (*k* elements).
Each training set yields to an individual model `M'`, which is used to predict the missing *k* value(s).
Each model `M'` is slightly different to `M`.
Thereby any observed value *y_i* is predicted once and the comparison between the observation and the prediction (*y_i^{pred(N-k)}*) yields to *q^2_cv*:

* q_cv^2 = 1 - SIGMA_i=1^N (y_i^pred(N-k) - y_i)^2 / (SIGMA_i=1^N (y_i - y_mean^(N-k,i)^2*

The arithmetic mean used in this equation, *y_mean^N-k,i*, is individually for any test set and calculated for the observed values comprised in the training set.

If *k>1*, the compilation of training and test set may have impact on the calculation of the predictive squared correlation coefficient.
To overcome biasing, one can repeat this calculation with various compilations of training and test set.
Thus, any observed value is predicted several times, according to the number of runs performed.
Remark, if the prediction performance is evaluated with cross-validation, the calculation of the predictive squared correlation coefficient, *q^2*, is more accurate than the calculation of the conventional squared correlation coefficient, *r^2*.

In addition to *r^2* and *q^2* the root-mean-square-error `rmse` is calculated to measure the accuracy of model `M`:

*rmse = √{\frac{∑\limits_{i=1}^N≤ft( y_i^{pred} - y_i\right)^2}{N-ν}}*

The `rmse` ist the difference between a model's prediction (*y_i^{pred}*) and the actual observation (*y_i*) and can be applied for both, calibration and prediction power.
It depends on the number of observations `N` and the method used to generate the model `M`.
The `rmse` tends to overestimate `M`.
According to Friedrich Bessel's suggestion [Upton and Cook 2008], this overestimation can be resolved while regarding the degrees of freedom, *ν*.
Thus in case of cross-validation, *ν=1* is recommended to calculate the `rmse` in relation to the prediction power.
The degrees of freedom, *ν*, for the calculation of `rmse` regarding the prediction power can be set as parameter for `cvq2()`

, `looq2()`

and `q2()`

.
In opposite *ν=0* is fixed while calculating the `rmse` in relation to the model calibration.

In case, the input is a comparison of observed and predicted values only (II), *r^2* respective *q^2* as well as their `rmse` are calculated immediately for these data. Neither a model `M` is generated nor a cross-validation is applied.

The package development started few years ago in the Ecological Chemistry Department during my time at the Helmholtz Centre for Environmental Research in Leipzig. Thereby it is based on Schüürmann et al. 2008: External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean.

Torsten Thalheim <torstenthalheim@gmx.de>

Cramer RD III. 1980. BC(DEF) Parameters. 2. An Empirical Structure-Based Scheme for the Prediction of Some Physical Properties.

*J. Am. Chem. Soc.***102:**1849-1859.Cramer RD III, Bunce JD, Patterson DE, Frank IE. 1988. Crossvalidation, Bootstrapping, and Partial Least Squares Compared with Multiple Linear Regression in Conventional QSAR Studies.

*Quant. Struct.-Act. Relat.***1988:**18-25.Organisation for Economic Co-operation and Development. 2007. Guidance document on the validation of (quantitative) structure-activity relationship [(Q)SAR] models.

*OECD Series on Testing and Assessment 69.*OECD Document ENV/JM/MONO(2007)2, pp 55 (paragraph no. 198) and 65 (Table 5.7).-
Schüürmann G, Ebert R-U, Chen J, Wang B, Kühne R. 2008. External validation and prediction employing the predictive squared correlation coefficient - test set activity mean vs training set activity mean.

*J. Chem. Inf. Model.***48:**2140-2145. Tropsha A, Gramatica P, Gombar VK. 2003. The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models.

*QSAR Comb. Sci.***22:**69-77.Upton G, Cook I. 2008. Oxford Dictionary of Statistics

*Oxford University Press***ISBN 978-0-19-954145-4**entry for "Variance (data)".

library(cvq2) data(cvq2.sample.A) result <- cvq2( cvq2.sample.A, y ~ x1 + x2 ) result data(cvq2.sample.B) result <- cvq2( cvq2.sample.B, y ~ x, nFold = 3 ) result data(cvq2.sample.B) result <- cvq2( cvq2.sample.B, y ~ x, nFold = 3, nRun = 5 ) result data(cvq2.sample.A) data(cvq2.sample.A_pred) result <- q2( cvq2.sample.A, cvq2.sample.A_pred, y ~ x1 + x2 ) result data(cvq2.sample.C) result <- calibPow( cvq2.sample.C ) result data(cvq2.sample.D) result <- predPow( cvq2.sample.D, obs_mean="observed_mean" ) result

[Package *cvq2* version 1.2.0 Index]