compare_methods {dendroTools}R Documentation

compare_methods

Description

Calculates performance metrics for calibration (train) and validation (test) data of different regression methods: multiple linear regression (MLR), artificial neural networks with Bayesian regularization training algorithm (BRNN), (ensemble of) model trees (MT) and random forest of regression trees (RF). With the subset argument, specific methods of interest could be specified. Calculated performance metrics are the correlation coefficient (r), the root mean squared error (RMSE), the root relative squared error (RRSE), the index of agreement (d), the reduction of error (RE), the coefficient of efficiency (CE), the detrended efficiency (DE) and mean bias. For each of the considered methods, there are also residual diagnostic plots available, separately for calibration, holdout and edge data, if applicable.

Usage

compare_methods(
  formula,
  dataset,
  k = 10,
  repeats = 2,
  optimize = TRUE,
  dataset_complete = NULL,
  BRNN_neurons = 1,
  MT_committees = 1,
  MT_neighbors = 5,
  MT_rules = 200,
  MT_unbiased = TRUE,
  MT_extrapolation = 100,
  MT_sample = 0,
  RF_ntree = 500,
  RF_maxnodes = 5,
  RF_mtry = 1,
  RF_nodesize = 1,
  seed_factor = 5,
  digits = 3,
  blocked_CV = FALSE,
  PCA_transformation = FALSE,
  log_preprocess = TRUE,
  components_selection = "automatic",
  eigenvalues_threshold = 1,
  N_components = 2,
  round_bias_cal = 15,
  round_bias_val = 4,
  n_bins = 30,
  edge_share = 0.1,
  MLR_stepwise = FALSE,
  stepwise_direction = "backward",
  methods = c("MLR", "BRNN", "MT", "RF"),
  tuning_metric = "RMSE",
  BRNN_neurons_vector = c(1, 2, 3),
  MT_committees_vector = c(1, 5, 10),
  MT_neighbors_vector = c(0, 5),
  MT_rules_vector = c(100, 200),
  MT_unbiased_vector = c(TRUE, FALSE),
  MT_extrapolation_vector = c(100),
  MT_sample_vector = c(0),
  RF_ntree_vector = c(100, 250, 500),
  RF_maxnodes_vector = c(5, 10, 20, 25),
  RF_mtry_vector = c(1),
  RF_nodesize_vector = c(1, 5, 10),
  holdout = NULL,
  holdout_share = 0.1,
  holdout_manual = NULL,
  total_reproducibility = FALSE
)

Arguments

formula

an object of class "formula" (or one that can be coerced to that class): a symbolic description of the model to be fitted.

dataset

a data frame with dependent and independent variables as columns and (optional) years as row names.

k

number of folds for cross-validation

repeats

number of cross-validation repeats. Should be equal or more than 1

optimize

if set to TRUE (default), the optimal values for the tuning parameters will be selected in a preliminary cross-validation procedure

dataset_complete

optional, a data frame with the full length of tree-ring parameter, which will be used to reconstruct the climate variable specified with the formula argument

BRNN_neurons

number of neurons to be used for the brnn method

MT_committees

an integer: how many committee models (e.g. boosting iterations) should be used?

MT_neighbors

how many, if any, neighbors should be used to correct the model predictions

MT_rules

an integer (or NA): define an explicit limit to the number of rules used (NA let’s Cubist decide).

MT_unbiased

a logical: should unbiased rules be used?

MT_extrapolation

a number between 0 and 100: since Cubist uses linear models, predictions can be outside of the outside of the range seen the training set. This parameter controls how much rule predictions are adjusted to be consistent with the training set.

MT_sample

a number between 0 and 99.9: this is the percentage of the dataset to be randomly selected for model building (not for out-of-bag type evaluation)

RF_ntree

number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times

RF_maxnodes

maximum number of terminal nodes trees in the forest can have

RF_mtry

number of variables randomly sampled as candidates at each split

RF_nodesize

minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time).

seed_factor

an integer that will be used to change the seed options for different repeats.

digits

integer of number of digits to be displayed in the final result tables

blocked_CV

default is FALSE, if changed to TRUE, blocked cross-validation will be used to compare regression methods.

PCA_transformation

if set to TRUE, all independent variables will be transformed using PCA transformation.

log_preprocess

if set to TRUE, variables will be transformed with logarithmic transformation before used in PCA

components_selection

character string specifying how to select the Principal Components used as predictors. There are three options: "automatic", "manual" and "plot_selection". If parameter is set to automatic, all scores with eigenvalues above 1 will be selected. This threshold could be changed by changing the eigenvalues_threshold argument. If parameter is set to "manual", user should set the number of components with N_components argument. If component selection is se to "plot_selection", Scree plot will be shown and user must manually enter the number of components used as predictors.

eigenvalues_threshold

threshold for automatic selection of Principal Components

N_components

number of Principal Components used as predictors

round_bias_cal

number of digits for bias in calibration period. Effects the outlook of the final ggplot of mean bias for calibration data (element 3 of the output list)

round_bias_val

number of digits for bias in validation period. Effects the outlook of the final ggplot of mean bias for validation data (element 4 of the output list)

n_bins

number of bins used for the histograms of mean bias

edge_share

the share of the data to be considered as the edge (extreme) data. This argument could be between 0.10 and 0.50. If the argument is set to 0.10, then the 5 considered to be the edge data.

MLR_stepwise

if set to TRUE, stepwise selection of predictors will be used for the MLR method

stepwise_direction

the mode of stepwise search, can be one of "both", "backward", or "forward", with a default of "backward".

methods

a vector of strings related to methods that will be compared. A full method vector is methods = c("MLR", "BRNN", "MT", "RF"). To use only a subset of methods, pass a vector of methods that you would like to compare.

tuning_metric

a string that specifies what summary metric will be used to select the optimal value of tuning parameters. By default, the argument is set to "RMSE". It is also possible to use "RSquared".

BRNN_neurons_vector

a vector of possible values for BRNN_neurons argument optimization

MT_committees_vector

a vector of possible values for MT_committees argument optimization

MT_neighbors_vector

a vector of possible values for MT_neighbors argument optimization

MT_rules_vector

a vector of possible values for MT_rules argument optimization

MT_unbiased_vector

a vector of possible values for MT_unbiased argument optimization

MT_extrapolation_vector

a vector of possible values for MT_extrapolation argument optimization

MT_sample_vector

a vector of possible values for MT_sample argument optimization

RF_ntree_vector

a vector of possible values for RF_ntree argument optimization

RF_maxnodes_vector

a vector of possible values for RF_maxnodes argument optimization

RF_mtry_vector

a vector of possible values for RF_mtry argument optimization

RF_nodesize_vector

a vector of possible values for RF_nodesize argument optimization

holdout

this argument is used to define observations, which are excluded from the cross-validation and hyperparameters optimization. The holdout argument must be a character with one of the following inputs: “early”, “late” or “manual”. If "early" or "late" characters are specified, then the early or late years will be used as a holdout data. How many of the "early" or "late" years are used as a holdout is specified with the argument holdout_share. If the argument holdout is set to “manual”, then supply a vector of years (or row names) to the argument holdout_manual. Defined years will be used as a holdout. For the holdout data, the same statistical measures are calculated as for the cross-validation. The results for holdout metrics are given in the output element $holdout_results.

holdout_share

the share of the whole dataset to be used as a holdout. Default is 0.10.

holdout_manual

a vector of years (or row names) which will be used as a holdout. calculated as for the cross-validation.

total_reproducibility

logical, default is FALSE. This argument ensures total reproducibility despite the inclusion/exclusion of different methods. By default, the optimization is done only for the methods, that are included in the methods vector. If one method is absent or added, the optimization phase is different, and this affects all the final cross-validation results. By setting the total_reproducibility = TRUE, all methods will be optimized, even though they are not included in the methods vector and the final results will be subset based on the methods vector. Setting the total_reproducibility to TRUE will result in longer optimization phase as well.

Value

a list with 19 elements:

  1. $mean_std - data frame with calculated metrics for the selected \ regression methods. For each regression method and each calculated metric, mean and standard deviation are given

  2. $ranks - data frame with ranks of calculated metrics: mean rank and share of rank_1 are given

  3. $edge_results - data frame with calculated performance metrics for the central-edge test. The central part of the data represents the calibration data, while the edge data, i.e. extreme values, represent the test/validation data. Different regression models are calibrated using the central data and validated for the edge (extreme) data. This test is particularly important to assess the performance of models for the predictions of the extreme data. The share of the edge (extreme) data is defined with the edge_share argument

  4. $holdout_results - calculated metrics for the holdout data

  5. $bias_cal - ggplot object of mean bias for calibration data

  6. $bias_val - ggplot object of mean bias for validation data

  7. $transfer_functions - ggplot or plotly object with transfer functions of methods

  8. $transfer_functions_together - ggplot or plotly object with transfer functions of methods plotted together

  9. $parameter_values - a data frame with specifications of parameters used for different regression methods

  10. $PCA_output - princomp object: the result output of the PCA analysis

  11. $reconstructions - ggplot object: reconstructed dependent variable based on the dataset_complete argument, facet is used to split plots by methods

  12. $reconstructions_together - ggplot object: reconstructed dependent variable based on the dataset_complete argument, all reconstructions are on the same plot

  13. $normal_QQ_cal - normal q-q plot for calibration data

  14. $normal_QQ_holdout - normal q-q plot for holdout data

  15. $normal_QQ_edge- normal q-q plot for edge data

  16. $residuals_vs_fitted_cal - residuals vs fitted values plot for calibration data

  17. $residuals_vs_fitted_holdout - residuals vs fitted values plot for holdout data

  18. $residuals_vs_fitted_edge - residuals vs fitted values plot for edge data

  19. $reconstructions_data - raw data that is used for creating reconstruction plots

References

Bishop, C.M., 1995. Neural Networks for Pattern Recognition. Oxford University Press, Inc. 482 pp.

Breiman, L., 1996. Bagging predictors. Machine Learning 24, 123-140.

Breiman, L., 2001. Random forests. Machine Learning 45, 5-32.

Burden, F., Winkler, D., 2008. Bayesian Regularization of Neural Networks, in: Livingstone, D.J. (ed.), Artificial Neural Networks: Methods and Applications, vol. 458. Humana Press, Totowa, NJ, pp. 23-42.

Hastie, T., Tibshirani, R., Friedman, J.H., 2009. The Elements of Statistical Learning : Data Mining, Inference, and Prediction, 2nd ed. Springer, New York xxii, 745 p. pp.

Ho, T.K., 1995. Random decision forests, Proceedings of the Third International Conference on Document Analysis and Recognition Volume 1. IEEE Computer Society, pp. 278-282.

Hornik, K., Buchta, C., Zeileis, A., 2009. Open-source machine learning: R meets Weka. Comput. Stat. 24, 225-232.

Perez-Rodriguez, P., Gianola, D., 2016. Brnn: Brnn (Bayesian Regularization for Feed-forward Neural Networks). R package version 0.6.

Quinlan, J.R., 1992. Learning with Continuous Classes, Proceedings of the 5th Australian Joint Conference on Artificial Intelligence (AI '92). World Scientific, Hobart, pp. 343-348.

Examples



# The examples below are enclosed within donttest{} to minimize the execution
# time during R package checks. #'

# An example with default settings of machine learning algorithms
library(dendroTools)
library(ggplot2)

data(example_dataset_1)
data(dataset_TRW)

example_1 <- compare_methods(formula = MVA ~  T_APR,
dataset = example_dataset_1, k = 5, repeats = 1, BRNN_neurons = 1,
RF_ntree = 100, RF_mtry = 2, RF_maxnodes = 35, seed_factor = 5)

# example_1$mean_std
# example_1$ranks
# example_1$bias_cal
# example_1$transfer_functions
# example_1$transfer_functions_together
# example_1$PCA_output
# example_1$parameter_values

example_2 <- compare_methods(formula = MVA ~ .,
dataset = example_dataset_1, k = 2, repeats = 2,
methods = c("MLR", "BRNN", "MT"),
optimize = TRUE, MLR_stepwise = TRUE)
# example_2$mean_std
# example_2$ranks
# example_2$bias_val
# example_2$transfer_functions
# example_2$transfer_functions_together
# example_2$parameter_values

comparison_TRW <- compare_methods(formula = T_Jun_Jul ~ TRW, dataset = dataset_TRW,
k = 3, repeats = 5, optimize = FALSE, methods = c("MLR", "BRNN", "RF", "MT"),
seed_factor = 5, dataset_complete = dataset_TRW_complete, MLR_stepwise = TRUE,
stepwise_direction = "backward")

# comparison_TRW$mean_std
# comparison_TRW$bias_val
# comparison_TRW$transfer_functions
# comparison_TRW$reconstructions
# comparison_TRW$reconstructions_together
# comparison_TRW$edge_results
# comparison_TRW$reconstructions_data



[Package dendroTools version 1.2.13 Index]