R: hsstan model for cross-validation

model.hsstan {nestedcv}

R Documentation

hsstan model for cross-validation

Description

This function applies a cross-validation (CV) procedure for training Bayesian models with hierarchical shrinkage priors using the hsstan package. The function allows the option of embedded filtering of predictors for feature selection within the CV loop. Within each training fold, an optional filtering of predictors is performed, followed by fitting of an hsstsan model. Predictions on the testing folds are brought back together and error estimation/ accuracy determined. The default is 10-fold CV. The function is implemented within the nestedcv package. The hsstan models do not require tuning of meta-parameters and therefore only a single CV procedure is needed to evaluate performance. This is implemented using the outer CV procedure in the nestedcv package. Supports binary outcome (logistic regression) or continuous outcome. Multinomial models are currently not supported.

Usage

model.hsstan(y, x, unpenalized = NULL, ...)

Arguments

`y`	Response vector. For classification this should be a factor.
`x`	Matrix of predictors
`unpenalized`	Vector of column names `x` which are always retained into the model (i.e. not penalized). Default `NULL` means the parameters for all predictors will be drawn from a hierarchical prior distribution, i.e. will be penalized. Note: if filtering of predictors is specified, then the vector of `unpenalized` predictors should also be passed to the filter function using the `filter_options$force_vars` argument. Filters currently implementing this option are the `partial_ttest_filter` for binary outcomes and the `lm_filter` for continuous outcomes.
`...`	Optional arguments passed to `hsstan`

Details

Caution should be used when setting the number of cores available for parallelisation. The default setting in hsstan is to use 4 cores to parallelise the Markov chains of the Bayesian inference procedure. This can be switched off either by adding argument cores = 1 (passed on to rstan) or setting options(mc.cores = 1).

Argument cv.cores in outercv() controls parallelisation over the outer CV folds. On unix/mac setting cv.cores to >1 will induce nested parallelisation which will generate an error, unless parallelisation of the chains is disabled using cores = 1 or setting options(mc.cores = 1).

Nested parallelisation is feasible if cv.cores is >1 and multicore_fork = FALSE is set as this uses cluster based parallelisation instead. Beware that large numbers of processes will be spawned. If we are performing 10-fold cross-validation with 4 chains and set cv.cores = 10 then 40 processes will be invoked simultaneously.

Value

An object of class hsstan

Author(s)

Athina Spiliopoulou

Examples


# Cross-validation is used to apply univariate filtering of predictors.
# only one CV split is needed (outercv) as the Bayesian model does not
# require learning of meta-parameters.

# control number of cores used for parallelisation over chains
oldopt <- options(mc.cores = 2)

# load iris dataset and simulate a continuous outcome
data(iris)
dt <- iris[, 1:4]
colnames(dt) <- c("marker1", "marker2", "marker3", "marker4")
dt <- as.data.frame(apply(dt, 2, scale))
dt$outcome.cont <- -3 + 0.5 * dt$marker1 + 2 * dt$marker2 + rnorm(nrow(dt), 0, 2)

library(hsstan)
# unpenalised covariates: always retain in the prediction model
uvars <- "marker1"
# penalised covariates: coefficients are drawn from hierarchical shrinkage
# prior
pvars <- c("marker2", "marker3", "marker4") # penalised covariates
# run cross-validation with univariate filter and hsstan
# dummy sampling for fast execution of example
# recommend 4 chains, warmup 1000, iter 2000 in practice
res.cv.hsstan <- outercv(y = dt$outcome.cont, x = dt[, c(uvars, pvars)],
                         model = "model.hsstan",
                         filterFUN = lm_filter,
                         filter_options = list(force_vars = uvars,
                                               nfilter = 2,
                                               p_cutoff = NULL,
                                               rsq_cutoff = 0.9),
                         n_outer_folds = 3,
                         chains = 2,
                         cv.cores = 1,
                         unpenalized = uvars, warmup = 100, iter = 200)
# view prediction performance based on testing folds
res.cv.hsstan$summary
# view coefficients for the final model
res.cv.hsstan$final_fit
# view covariates selected by the univariate filter
res.cv.hsstan$final_vars

# use hsstan package to examine the Bayesian model
sampler.stats(res.cv.hsstan$final_fit)
print(projsel(res.cv.hsstan$final_fit), digits = 4)  # adding marker2
options(oldopt)  # reset configuation

# Here adding `marker2` improves the model fit: substantial decrease of
# KL-divergence from the full model to the submodel. Adding `marker3` does
# not improve the model fit: no decrease of KL-divergence from the full model
# to the submodel.

[Package nestedcv version 0.7.9 Index]