R: Run search and performance evaluation with cross-validation

cv_varsel {projpred}

R Documentation

Run search and performance evaluation with cross-validation

Description

Run the search part and the evaluation part for a projection predictive variable selection. The search part determines the predictor ranking (also known as solution path), i.e., the best submodel for each submodel size (number of predictor terms). The evaluation part determines the predictive performance of the submodels along the predictor ranking. In contrast to varsel(), cv_varsel() performs a cross-validation (CV) by running the search part with the training data of each CV fold separately (an exception is explained in section "Note" below) and by running the evaluation part on the corresponding test set of each CV fold. A special method is cv_varsel.vsel() because it re-uses the search results from an earlier cv_varsel() (or varsel()) run, as illustrated in the main vignette.

Usage

cv_varsel(object, ...)

## Default S3 method:
cv_varsel(object, ...)

## S3 method for class 'vsel'
cv_varsel(
  object,
  cv_method = object$cv_method %||% "LOO",
  nloo = object$nloo,
  K = object$K %||% if (!inherits(object, "datafit")) 5 else 10,
  cvfits = object$cvfits,
  validate_search = object$validate_search %||% TRUE,
  ...
)

## S3 method for class 'refmodel'
cv_varsel(
  object,
  method = "forward",
  cv_method = if (!inherits(object, "datafit")) "LOO" else "kfold",
  ndraws = NULL,
  nclusters = 20,
  ndraws_pred = 400,
  nclusters_pred = NULL,
  refit_prj = !inherits(object, "datafit"),
  nterms_max = NULL,
  penalty = NULL,
  verbose = TRUE,
  nloo = object$nobs,
  K = if (!inherits(object, "datafit")) 5 else 10,
  cvfits = object$cvfits,
  search_control = NULL,
  lambda_min_ratio = 1e-05,
  nlambda = 150,
  thresh = 1e-06,
  validate_search = TRUE,
  seed = NA,
  search_terms = NULL,
  search_out = NULL,
  parallel = getOption("projpred.prll_cv", FALSE),
  ...
)

Arguments

`object`	An object of class `refmodel` (returned by `get_refmodel()` or `init_refmodel()`) or an object that can be passed to argument `object` of `get_refmodel()`.
`...`	For `cv_varsel.default()`: Arguments passed to `get_refmodel()` as well as to `cv_varsel.refmodel()`. For `cv_varsel.vsel()`: Arguments passed to `cv_varsel.refmodel()`. For `cv_varsel.refmodel()`: Arguments passed to the divergence minimizer (see argument `div_minimizer` of `init_refmodel()` as well as section "Draw-wise divergence minimizers" of projpred-package) when refitting the submodels for the performance evaluation (if `refit_prj` is `TRUE`).
`cv_method`	The CV method, either `"LOO"` or `"kfold"`. In the `"LOO"` case, a Pareto-smoothed importance sampling leave-one-out CV (PSIS-LOO CV) is performed, which avoids refitting the reference model `nloo` times (in contrast to a standard LOO CV). In the `"kfold"` case, a `K`-fold CV is performed. See also section "Note" below.
`nloo`	Caution: Still experimental. Only relevant if `cv_method = "LOO"`. If `nloo` is smaller than the number of all observations, approximate full LOO CV using probability-proportional-to-size-sampling (PPS) to make accurate computation only for `nloo` (anything from 1 to the number of all observations) leave-one-out folds (Magnusson et al., 2019). Smaller values lead to faster computation but higher uncertainty in the evaluation part. If `NULL`, all observations are used (as by default).
`K`	Only relevant if `cv_method = "kfold"` and if `cvfits` is `NULL` (which is the case for reference model objects created by `get_refmodel.stanreg()` or `brms::get_refmodel.brmsfit()`). Number of folds in `K`-fold CV.
`cvfits`	Only relevant if `cv_method = "kfold"`. The same as argument `cvfits` of `init_refmodel()`, but repeated here so that output from `run_cvfun()` can be inserted here straightforwardly.
`validate_search`	A single logical value indicating whether to cross-validate also the search part, i.e., whether to run the search separately for each CV fold (`TRUE`) or not (`FALSE`). We strongly do not recommend setting this to `FALSE`, because this is known to bias the predictive performance estimates of the selected submodels. However, setting this to `FALSE` can sometimes be useful because comparing the results to the case where this argument is `TRUE` gives an idea of how strongly the search is (over-)fitted to the data (the difference corresponds to the search degrees of freedom or the effective number of parameters introduced by the search).
`method`	The method for the search part. Possible options are `"forward"` for forward search and `"L1"` for L1 search. See also section "Details" below.
`ndraws`	Number of posterior draws used in the search part. Ignored if `nclusters` is not `NULL` or in case of L1 search (because L1 search always uses a single cluster). If both (`nclusters` and `ndraws`) are `NULL`, the number of posterior draws from the reference model is used for `ndraws`. See also section "Details" below.
`nclusters`	Number of clusters of posterior draws used in the search part. Ignored in case of L1 search (because L1 search always uses a single cluster). For the meaning of `NULL`, see argument `ndraws`. See also section "Details" below.
`ndraws_pred`	Only relevant if `refit_prj` is `TRUE`. Number of posterior draws used in the evaluation part. Ignored if `nclusters_pred` is not `NULL`. If both (`nclusters_pred` and `ndraws_pred`) are `NULL`, the number of posterior draws from the reference model is used for `ndraws_pred`. See also section "Details" below.
`nclusters_pred`	Only relevant if `refit_prj` is `TRUE`. Number of clusters of posterior draws used in the evaluation part. For the meaning of `NULL`, see argument `ndraws_pred`. See also section "Details" below.
`refit_prj`	For the evaluation part, should the submodels along the predictor ranking be fitted again (`TRUE`) or should their fits from the search part be re-used (`FALSE`)?
`nterms_max`	Maximum submodel size (number of predictor terms) up to which the search is continued. If `NULL`, then `min(19, D)` is used where `D` is the number of terms in the reference model (or in `search_terms`, if supplied). Note that `nterms_max` does not count the intercept, so use `nterms_max = 0` for the intercept-only model. (Correspondingly, `D` above does not count the intercept.)
`penalty`	Only relevant for L1 search. A numeric vector determining the relative penalties or costs for the predictors. A value of `0` means that those predictors have no cost and will therefore be selected first, whereas `Inf` means those predictors will never be selected. If `NULL`, then `1` is used for each predictor.
`verbose`	A single logical value indicating whether to print out additional information during the computations.
`search_control`	A `list` of "control" arguments (i.e., tuning parameters) for the search. In case of forward search, these arguments are passed to the divergence minimizer (see argument `div_minimizer` of `init_refmodel()` as well as section "Draw-wise divergence minimizers" of projpred-package). In case of forward search, `NULL` causes `...` to be used not only for the performance evaluation, but also for the search. In case of L1 search, possible arguments are: `lambda_min_ratio`: Ratio between the smallest and largest lambda in the L1-penalized search (default: `1e-5`). This parameter essentially determines how long the search is carried out, i.e., how large submodels are explored. No need to change this unless the program gives a warning about this. `nlambda`: Number of values in the lambda grid for L1-penalized search (default: `150`). No need to change this unless the program gives a warning about this. `thresh`: Convergence threshold when computing the L1 path (default: `1e-6`). Usually, there is no need to change this.
`lambda_min_ratio`	Deprecated (please use `search_control` instead). Only relevant for L1 search. Ratio between the smallest and largest lambda in the L1-penalized search. This parameter essentially determines how long the search is carried out, i.e., how large submodels are explored. No need to change this unless the program gives a warning about this.
`nlambda`	Deprecated (please use `search_control` instead). Only relevant for L1 search. Number of values in the lambda grid for L1-penalized search. No need to change this unless the program gives a warning about this.
`thresh`	Deprecated (please use `search_control` instead). Only relevant for L1 search. Convergence threshold when computing the L1 path. Usually, there is no need to change this.
`seed`	Pseudorandom number generation (PRNG) seed by which the same results can be obtained again if needed. Passed to argument `seed` of `set.seed()`, but can also be `NA` to not call `set.seed()` at all. If not `NA`, then the PRNG state is reset (to the state before calling `cv_varsel()`) upon exiting `cv_varsel()`. Here, `seed` is used for clustering the reference model's posterior draws (if `!is.null(nclusters)` or `!is.null(nclusters_pred)`), for subsampling PSIS-LOO CV folds (if `nloo` is smaller than the number of observations), for sampling the folds in `K`-fold CV, and for drawing new group-level effects when predicting from a multilevel submodel (however, not yet in case of a GAMM).
`search_terms`	Only relevant for forward search. A custom character vector of predictor term blocks to consider for the search. Section "Details" below describes more precisely what "predictor term block" means. The intercept (`"1"`) is always included internally via `union()`, so there's no difference between including it explicitly or omitting it. The default `search_terms` considers all the terms in the reference model's formula.
`search_out`	Intended for internal use.
`parallel`	A single logical value indicating whether to run costly parts of the CV in parallel (`TRUE`) or not (`FALSE`). See also section "Note" below.

Details

Arguments ndraws, nclusters, nclusters_pred, and ndraws_pred are automatically truncated at the number of posterior draws in the reference model (which is 1 for datafits). Using less draws or clusters in ndraws, nclusters, nclusters_pred, or ndraws_pred than posterior draws in the reference model may result in slightly inaccurate projection performance. Increasing these arguments affects the computation time linearly.

For argument method, there are some restrictions: For a reference model with multilevel or additive formula terms or a reference model set up for the augmented-data projection, only the forward search is available. Furthermore, argument search_terms requires a forward search to take effect.

L1 search is faster than forward search, but forward search may be more accurate. Furthermore, forward search may find a sparser model with comparable performance to that found by L1 search, but it may also start overfitting when more predictors are added.

An L1 search may select an interaction term before all involved lower-order interaction terms (including main-effect terms) have been selected. In projpred versions > 2.6.0, the resulting predictor ranking is automatically modified so that the lower-order interaction terms come before this interaction term, but if this is conceptually undesired, choose the forward search instead.

The elements of the search_terms character vector don't need to be individual predictor terms. Instead, they can be building blocks consisting of several predictor terms connected by the + symbol. To understand how these building blocks work, it is important to know how projpred's forward search works: It starts with an empty vector chosen which will later contain already selected predictor terms. Then, the search iterates over model sizes j \in \{0, ..., J\} (with J denoting the maximum submodel size, not counting the intercept). The candidate models at model size j are constructed from those elements from search_terms which yield model size j when combined with the chosen predictor terms. Note that sometimes, there may be no candidate models for model size j. Also note that internally, search_terms is expanded to include the intercept ("1"), so the first step of the search (model size 0) always consists of the intercept-only model as the only candidate.

As a search_terms example, consider a reference model with formula y ~ x1 + x2 + x3. Then, to ensure that x1 is always included in the candidate models, specify search_terms = c("x1", "x1 + x2", "x1 + x3", "x1 + x2 + x3") (or, in a simpler way that leads to the same results, search_terms = c("x1", "x1 + x2", "x1 + x3"), for which helper function force_search_terms() exists). This search would start with y ~ 1 as the only candidate at model size 0. At model size 1, y ~ x1 would be the only candidate. At model size 2, y ~ x1 + x2 and y ~ x1 + x3 would be the two candidates. At the last model size of 3, y ~ x1 + x2 + x3 would be the only candidate. As another example, to exclude x1 from the search, specify search_terms = c("x2", "x3", "x2 + x3") (or, in a simpler way that leads to the same results, search_terms = c("x2", "x3")).

Value

An object of class vsel. The elements of this object are not meant to be accessed directly but instead via helper functions (see the main vignette and projpred-package).

Note

If validate_search is FALSE, the search is not included in the CV so that only a single full-data search is run.

For PSIS-LOO CV, projpred calls loo::psis() (or, exceptionally, loo::sis(), see below) with r_eff = NA. This is only a problem if there was extreme autocorrelation between the MCMC iterations when the reference model was built. In those cases however, the reference model should not have been used anyway, so we don't expect projpred's r_eff = NA to be a problem.

PSIS cannot be used if the draws have different (i.e., nonconstant) weights or if the number of draws is too small. In such cases, projpred resorts to standard importance sampling (SIS) and throws a warning about this. Throughout the documentation, the term "PSIS" is used even though in fact, projpred resorts to SIS in these special cases.

With parallel = TRUE, costly parts of projpred's CV are run in parallel. Costly parts are the fold-wise searches and performance evaluations in case of validate_search = TRUE. (Note that in case of K-fold CV, the K reference model refits are not affected by argument parallel; only projpred's CV is affected.) The parallelization is powered by the foreach package. Thus, any parallel (or sequential) backend compatible with foreach can be used, e.g., the backends from packages doParallel, doMPI, or doFuture. For GLMs, this CV parallelization should work reliably, but for other models (such as GLMMs), it may lead to excessive memory usage which in turn may crash the R session (on Unix systems, setting an appropriate memory limit via unix::rlimit_as() may avoid crashing the whole machine). However, the problem of excessive memory usage is less pronounced for the CV parallelization than for the projection parallelization described in projpred-package. In that regard, the CV parallelization is recommended over the projection parallelization.

References

Magnusson, Måns, Michael Andersen, Johan Jonasson, and Aki Vehtari. 2019. "Bayesian Leave-One-Out Cross-Validation for Large Data." In Proceedings of the 36th International Conference on Machine Learning, edited by Kamalika Chaudhuri and Ruslan Salakhutdinov, 97:4244–53. Proceedings of Machine Learning Research. PMLR. https://proceedings.mlr.press/v97/magnusson19a.html.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. "Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC." Statistics and Computing 27 (5): 1413–32. doi:10.1007/s11222-016-9696-4.

Vehtari, Aki, Daniel Simpson, Andrew Gelman, Yuling Yao, and Jonah Gabry. 2022. "Pareto Smoothed Importance Sampling." arXiv. doi:10.48550/arXiv.1507.02646.

Examples


# Data:
dat_gauss <- data.frame(y = df_gaussian$y, df_gaussian$x)

# The `stanreg` fit which will be used as the reference model (with small
# values for `chains` and `iter`, but only for technical reasons in this
# example; this is not recommended in general):
fit <- rstanarm::stan_glm(
  y ~ X1 + X2 + X3 + X4 + X5, family = gaussian(), data = dat_gauss,
  QR = TRUE, chains = 2, iter = 1000, refresh = 0, seed = 9876
)

# Run cv_varsel() (with L1 search and small values for `K`, `nterms_max`, and
# `nclusters_pred`, but only for the sake of speed in this example; this is
# not recommended in general):
cvvs <- cv_varsel(fit, method = "L1", cv_method = "kfold", K = 2,
                  nterms_max = 3, nclusters_pred = 10, seed = 5555)
# Now see, for example, `?print.vsel`, `?plot.vsel`, `?suggest_size.vsel`,
# and `?ranking` for possible post-processing functions.

[Package projpred version 2.8.0 Index]