R: Perform K-fold cross-validation

hbam_cv {hbamr}

R Documentation

Perform K-fold cross-validation

Description

This function performs K-fold cross-validation for an HBAM or FBAM model in order to estimate the expected log pointwise predictive density for a new dataset (ELPD). Multiple chains for one or more folds can be run in parallel using the future package.

Usage

hbam_cv(
  self = NULL,
  stimuli = NULL,
  model = "HBAM",
  allow_miss = 0,
  req_valid = NA,
  req_unique = 2,
  prefs = NULL,
  group_id = NULL,
  prep_data = TRUE,
  data = NULL,
  K = 10,
  chains = 2,
  warmup = 1000,
  iter = 3000,
  seed = 1,
  sigma_alpha = NULL,
  sigma_beta = 0.35,
  sigma_mu_alpha = NULL,
  sigma_mu_beta = 0.3,
  ...
)

Arguments

`self`	A numerical vector of N ideological self-placements. Any missing data must be coded as NA. This argument will not be used if the data have been prepared in advance via the `prep_data()` function.
`stimuli`	An N × J matrix of numerical stimulus placements, where J is the number of stimuli. Any missing data must be coded as NA. This argument will not be used if the data have been prepared in advance via the `prep_data()` function.
`model`	Character: Name of the model to be used. Defaults to HBAM.
`allow_miss`	Integer specifying how many missing stimulus positions to be accepted for an individual still to be included in the analysis. This argument will not be used if the data have been prepared in advance via the `prep_data()` function. Defaults to 0.
`req_valid`	Integer specifying how many valid observations to require for a respondent to be included in the analysis. The default is `req_valid = J - allow_miss`, but if specified, `req_valid` takes precedence. This argument will not be used if the data have been prepared in advance via the `prep_data()` function.
`req_unique`	Integer specifying how may unique positions on the ideological scale each respondent is required to have used when placing the stimuli in order to be included in the analysis. The default is `req_unique = 2`. This argument will not be used if the data have been prepared in advance via the `prep_data()` function.
`prefs`	An N × J matrix of numerical stimulus ratings or preference scores. These data are only required by the HBAM_R and HBAM_R_MINI models and will be ignored when fitting other models.
`group_id`	Integer vector of length N identifying which group each respondent belongs to. The supplied vector should range from 1 to the total number of groups in the data, and all integers between these numbers should be represented in the supplied data. These data are only required by models with "MULTI" in their name and will be ignored when fitting other models.
`prep_data`	Logical: Should the data be prepared before fitting the model? (Or have the data been prepared in advance by first running the `prep_data()` and `prep_data_cv()` functions)? If so, set `prep_data = FALSE`.) Defaults to `prep_data = TRUE`.
`data`	A list of data produced by `prep_data()` followed by `prep_data_cv()`.
`K`	An integer above 2, specifying the number of folds to use in the analysis. Defaults to 10.
`chains`	A positive integer specifying the number of Markov chains to use per fold. Defaults to 2.
`warmup`	A positive integer specifying the number of warmup (aka burn-in) iterations per chain. It defaults to 1000. The number of warmup iterations should be smaller than `iter`.
`iter`	A positive integer specifying the number of iterations for each chain (including warmup). It defaults to 3000 as running fewer chains for longer is a more efficient way to obtain a certain number of draws (and cross-validation can be computationally expensive).
`seed`	An integer passed on to `set.seed` before creating the folds to increase reproducibility and comparability. Defaults to 1 and only applies to fold-creation when the argument `prep_data` is `TRUE`. The supplied `seed` argument is also used to generate seeds for the sampling algorithm.
`sigma_alpha`	A positive numeric value specifying the standard deviation of the prior on the shift parameters in the FBAM model, or the standard deviation of the parameters' deviation from the group-means in FBAM_MULTI models. (This argument will be ignored by HBAM models.) Defaults to B / 4, where B measures the length of the survey scale as the number of possible placements on one side of the center.
`sigma_beta`	A positive numeric value specifying the standard deviation of the prior on the logged stretch parameters in the FBAM model, or the standard deviation of the logged parameters' deviation from the group-means in FBAM_MULTI models. (This argument will be ignored by HBAM models.) Defaults to .35.
`sigma_mu_alpha`	A positive numeric value specifying the standard deviation of the prior on the group-means of the shift parameters in MULTI-type models. Defaults to B / 5.
`sigma_mu_beta`	A positive numeric value specifying the standard deviation of the prior on the group-means of the logged stretch parameters in MULTI-type models. Defaults to .3.
`...`	Arguments passed to `rstan::sampling()`.

Value

A list of classes kfold and loo, which contains the following named elements:

"estimates": A ⁠1x2⁠ matrix containing the ELPD estimate and its standard error. The columns have names "Estimate" and "SE".
"pointwise": A Nx1 matrix with column name "elpd_kfold" containing the pointwise contributions for each data point.

Examples


# Loading and re-coding ANES 1980 data:
data(LC1980)
LC1980[LC1980 == 0 | LC1980 == 8 | LC1980 == 9] <- NA

# Making a small subset of the data for illustration:
self <- LC1980[1:50, 1]
stimuli <- LC1980[1:50, -1]

# Preparing to run chains in parallel using 2 cores via the future package:
  # Note: You would normally want to use all physical cores for this.
future::plan(future::multisession, workers = 2)

# Performing 10-fold cross-validation for the HBAM_MINI model:
  # Note: You would typically want to run the chains for more iterations.
cv_hbam_mini <- hbam_cv(self, stimuli, model = "HBAM_MINI",
                        chains = 1, warmup = 500, iter = 1000)

# Performing 10-fold cross-validation for the FBAM model:
cv_FBAM <- hbam_cv(self, stimuli, model = "FBAM",
                        chains = 1, warmup = 500, iter = 1000)

# Comparing the results using the loo package:
loo::loo_compare(list(HBAM_MINI = cv_hbam_mini,
                 FBAM = cv_FBAM))

# Stop the cluster of parallel sessions:
future::plan(future::sequential)

[Package hbamr version 2.3.1 Index]