R: MRF cross validation and assessment of predictive performance

cv_MRF_diag {MRFcov}

R Documentation

MRF cross validation and assessment of predictive performance

Description

cv_MRF_diag runs cross validation of MRFcov models and tests predictive performance.

cv_MRF_diag_rep fits a single node-optimised model and test's this model's predictive performance across multiple test subsets of the data.

cv_MRF_diag_rep_spatial fits a single node-optimised spatial model and test's this model's predictive performance across multiple test subsets of the data.

All cv_MRF functions assess model predictive performance and produce either diagnostic plots or matrices of predictive metrics.

Usage

cv_MRF_diag(
  data,
  symmetrise,
  n_nodes,
  n_cores,
  sample_seed,
  n_folds,
  n_fold_runs,
  n_covariates,
  compare_null,
  family,
  plot = TRUE,
  cached_model,
  cached_predictions,
  mod_labels = NULL
)

cv_MRF_diag_rep(
  data,
  symmetrise,
  n_nodes,
  n_cores,
  sample_seed,
  n_folds,
  n_fold_runs,
  n_covariates,
  compare_null,
  family,
  plot = TRUE
)

cv_MRF_diag_rep_spatial(
  data,
  coords,
  symmetrise,
  n_nodes,
  n_cores,
  sample_seed,
  n_folds,
  n_fold_runs,
  n_covariates,
  compare_null,
  family,
  plot = TRUE
)

Arguments

`data`	Dataframe. The input data where the `n_nodes` left-most variables are variables that are to be represented by nodes in the graph. Note that `NA`'s are allowed for covariates. If present, these missing values will be imputed from the distribution `rnorm(mean = 0, sd = 1)`, which assumes that all covariates are scaled and centred (i.e. by using the function `scale` or similar)
`symmetrise`	The method to use for symmetrising corresponding parameter estimates (which are taken from separate regressions). Options are `min` (take the coefficient with the smallest absolute value), `max` (take the coefficient with the largest absolute value) or `mean` (take the mean of the two coefficients). Default is `mean`
`n_nodes`	Positive integer. The index of the last column in `data` which is represented by a node in the final graph. Columns with index greater than n_nodes are taken as covariates. Default is the number of columns in `data`, corresponding to no additional covariates
`n_cores`	Positive integer. The number of cores to spread the job across using `makePSOCKcluster`. Default is 1 (no parallelisation)
`sample_seed`	Numeric. This seed will be used as the basis for dividing data into folds. Default is a random seed between 1 and 100000
`n_folds`	Integer. The number of folds for cross-validation. Default is 10
`n_fold_runs`	Integer. The number of total training runs to perform. During each run, the data will be split into `n_folds` folds and the observed data in each fold will be compared to their respective predictions. Defaults to `n_folds`
`n_covariates`	Positive integer. The number of covariates in `data`, before cross-multiplication
`compare_null`	Logical. If `TRUE`, null models will also be run and plotted to assess the influence of including covariates on model predictive performance. Default is `FALSE`
`family`	The response type. Responses can be quantitative continuous (`family = "gaussian"`), non-negative counts (`family = "poisson"`) or binomial 1s and 0s (`family = "binomial"`).
`plot`	Logical. If `TRUE`, `ggplot2` objects are returned. If `FALSE`, the prediction metrics are returned as a matrix. Default is `TRUE`
`cached_model`	Used by function `cv_MRF_diag_rep` to store an optimised model and prevent unneccessary replication of node-optimised model fitting
`cached_predictions`	Used by function `cv_MRF_diag_rep` to store predictions from optimised models and prevent unneccessary replication
`mod_labels`	Optional character string of labels for the two models being compared (if `compare_null == TRUE`)
`coords`	A two-column `dataframe` (with `nrow(coords) == nrow(data)`) representing the spatial coordinates of each observation in `data`. Ideally, these coordinates will represent Latitude and Longitude GPS points for each observation.

Details

Node-optimised models are fitted using cv.glmnet, and these models is used to predict data test subsets. Test and training data subsets are created using createFolds.

To account for uncertainty in parameter estimates and in random fold generation, it is recommended to perform cross-validation multiple times (by controlling the n_fold_runs argument) using cv_MRF_diag_rep to supply a single cached model and that model's predictions. This is useful for optimising a single model (using cv.glmnet) and testing this model's predictive performance across many test subsets. Alternatively, one can run cv_MRF_diag many times to fit different models in each iteration. This will be slower but technically more sound

Value

If plot = TRUE, a ggplot2 object is returned. This will be a plot containing boxplots of predictive metrics across test sets using the optimised model (see cv.glmnet for further details of lambda1 optimisation). If plot = FALSE, a matrix of prediction metrics is returned.

References

Clark, NJ, Wells, K and Lindberg, O. Unravelling changing interspecific interactions across environmental gradients using Markov random fields. (2018). Ecology doi: 10.1002/ecy.2221 Full text here.

Examples


data("Bird.parasites")
# Generate boxplots of model predictive metrics
cv_MRF_diag(data = Bird.parasites, n_nodes = 4,
           n_cores = 1, family = 'binomial')

# Generate boxplots comparing the CRF to an MRF model (no covariates)
cv_MRF_diag(data = Bird.parasites, n_nodes = 4,
           n_cores = 1, family = 'binomial',
           compare_null = TRUE)

# Replicate 10-fold cross-validation 10 times
cv.preds <- cv_MRF_diag_rep(data = Bird.parasites, n_nodes = 4,
                           n_cores = 1, family = 'binomial',
                           compare_null = TRUE,
                           plot = FALSE, n_fold_runs = 10)

# Plot model sensitivity and % true predictions
library(ggplot2)
gridExtra::grid.arrange(
 ggplot(data = cv.preds, aes(y = mean_sensitivity, x = model)) +
       geom_boxplot() + theme(axis.text.x = ggplot2::element_blank()) +
       labs(x = ''),
 ggplot(data = cv.preds, aes(y = mean_tot_pred, x = model)) +
       geom_boxplot(),
       ncol = 1,
 heights = c(1, 1))

# Create some sample Poisson data with strong correlations
cov <- rnorm(500, 0.2)
cov2 <- rnorm(500, 1)
sp.2 <- rpois(500, lambda = exp(1.5 + (cov * 0.9)))
poiss.dat <- data.frame(sp.1 = rpois(500, lambda = exp(0.5 + (cov * 0.3))),
                       sp.2 = sp.2,
                       sp.3 = rpois(500, lambda = exp(log(sp.2 + 1) + (cov * -0.5))),
                       cov = cov,
                       cov2 = cov2)

# A CRF should produce a better fit (lower deviance, lower MSE)
cvMRF.poiss <- cv_MRF_diag(data = poiss.dat, n_nodes = 3,
                          n_folds = 10,
                          family = 'poisson',
                          compare_null = TRUE, plot = TRUE)

[Package MRFcov version 1.0.39 Index]