statsModelSeries {pedometrics} | R Documentation |
Obtain performance statistics of a series of linear models
Description
Compute several statistics measuring the performance of a series of linear models built using
buildModelSeries()
, with an option to rank the models based on one of the returned
performance statistics.
Usage
statsModelSeries(model, design.info, arrange.by, digits)
statsMS(model, design.info, arrange.by, digits)
Arguments
model |
A list of linear models returned by |
design.info |
Extra information about the linear models in the series. |
arrange.by |
Character string defining if the table with the performance statistics of the
linear models should be arranged, and which column should be used. Available options are
|
digits |
Integer or vector with six integers indicating the number of decimal places to be used to round the performance statistics. If a vector is passed to the function, the number of decimal places should be in the following order:
|
Details
This function was devised to deal with a list of linear models generated by the function
buildModelSeries()
. The main objective is to compare several linear models using
several performance statistics. Such statistics can then be used to rank the linear models and
identify, for example, the best performing model, given the selected performance statistics.
An important feature of buildModelSeries()
is that it uses the information about
the initial number of candidate predictor variables offered to the build the model to calculate
penalized or adjusted measures of model performance. Such information is recorded as an attribute
of the final model selected by buildModelSeries()
. This feature was included in
statsModelSeries()
because data-driven variable selection results biased linear
models (too optimistic), and the effective number of degrees of freedom is close to the number of
candidate predictor variables initially offered to the model (Harrell, 2001).
Value
A data frame with several performance statistics:
- id
Identification of the model.
- candidates
Number of candidate predictor variables initially offered to the model.
- df
Number of degrees of freedom of the final selected model.
- aic
Akaike's Information Criterion (AIC). Obtained using
extractAIC
.- rmse
Root-mean squared error, calculated based on the number of candidate predictor variables initially offered to the model.
- nrmse
Normalized Root-mean squared error, calculated as the ratio between the RMSE and the standard deviation of the observed values of the dependent variable.
- r2
Multiple coefficient of determination.
- adj_r2
Adjusted multiple coefficient of determination.
- ADJ_r2
Adjusted multiple coefficient of determination. Calculations are done based on the number of candidate predictor variables initially offered to the model.
TODO
Include other performance statistics such as: PRESS, BIC, Mallow's Cp, max(VIF);
Add option to select which performance statistics should be returned.
Author(s)
Alessandro Samuel-Rosa alessandrosamuelrosa@gmail.com
References
Harrell, F. E. (2001) Regression modelling strategies: with applications to linear models, logistic regression, and survival analysis. First edition. New York: Springer.
Venables, W. N. and Ripley, B. D. (2002) Modern applied statistics with S. Fourth edition. New York: Springer.
A. Samuel-Rosa, G. B. M. Heuvelink, G. de Mattos Vasques, and L. H. C. dos Anjos, Do more detailed environmental covariates deliver more accurate soil maps?, Geoderma, vol. 243–244, pp. 214–227, May 2015, doi: 10.1016/j.geoderma.2014.12.017.
See Also
buildModelSeries()
, plotModelSeries()
Examples
if (interactive()) {
# based on the second example of function MASS:stepAIC()
require(MASS)
cpus1 <- cpus
for(v in names(cpus)[2:7])
cpus1[[v]] <- cut(cpus[[v]], unique(quantile(cpus[[v]])),
include.lowest = TRUE)
cpus0 <- cpus1[, 2:8] # excludes names, authors' predictions
cpus.samp <- sample(1:209, 100)
cpus.form <- list(formula(log10(perf) ~ syct + mmin + mmax + cach + chmin +
chmax + perf),
formula(log10(perf) ~ syct + mmin + cach + chmin + chmax),
formula(log10(perf) ~ mmax + cach + chmin + chmax + perf))
data <- cpus1[cpus.samp,2:8]
cpus.ms <- buildModelSeries(cpus.form, data, vif = TRUE, aic = TRUE)
cpus.des <- data.frame(a = c(0, 1, 0), b = c(1, 0, 1), c = c(1, 1, 0))
stats <- statsModelSeries(cpus.ms, design.info = cpus.des, arrange.by = "aic")
}