R: Bootstrapping or multi-sample splits for variable selection

resample {modnets}

R Documentation

Bootstrapping or multi-sample splits for variable selection

Description

Multiple resampling procedures for selecting variables for a final network model. There are three resampling methods that can be parameterized in a variety of different ways. The ultimate goal is to fit models across iterated resamples with variable selection procedures built in so as to home in on the best predictors to include within a given model. The methods available include: bootstrapped resampling, multi-sample splitting, and stability selection.

Usage

resample(
  data,
  m = NULL,
  niter = 10,
  sampMethod = "bootstrap",
  criterion = "AIC",
  method = "glmnet",
  rule = "OR",
  gamma = 0.5,
  nfolds = 10,
  nlam = 50,
  which.lam = "min",
  threshold = FALSE,
  bonf = FALSE,
  alpha = 0.05,
  exogenous = TRUE,
  split = 0.5,
  center = TRUE,
  scale = FALSE,
  varSeed = NULL,
  seed = NULL,
  verbose = TRUE,
  lags = NULL,
  binary = NULL,
  type = "g",
  saveMods = TRUE,
  saveData = FALSE,
  saveVars = FALSE,
  fitit = TRUE,
  nCores = 1,
  cluster = "mclapply",
  block = FALSE,
  beepno = NULL,
  dayno = NULL,
  ...
)

Arguments

`data`	`n x k` dataframe. Cannot supply a matrix as input.
`m`	Character vector or numeric vector indicating the moderator(s), if any. Can also specify `"all"` to make every variable serve as a moderator, or `0` to indicate that there are no moderators. If the length of `m` is `k - 1` or longer, then it will not be possible to have the moderators as exogenous variables. Thus, `exogenous` will automatically become `FALSE`.
`niter`	Number of iterations for the resampling procedure.
`sampMethod`	Character string indicating which type of procedure to use. `"bootstrap"` is a standard bootstrapping procedure. `"split"` is the multi-sample split procedure where the data are split into disjoint training and test sets, the variables to be modeled are selected based on the training set, and then the final model is fit to the test set. `"stability"` is stability selection, where models are fit to each of two disjoint subsamples of the data, and it is calculated how frequently each variable is selected in each subset, as well how frequently they are simultaneously selected in both subsets at each iteration.
`criterion`	The criterion for the variable selection procedure. Options include: `"cv", "aic", "bic", "ebic", "cp", "rss", "adjr2", "rsq", "r2"`. `"CV"` refers to cross-validation, the information criteria are `"AIC", "BIC", "EBIC"`, and `"Cp"`, which refers to Mallow's Cp. `"RSS"` is the residual sum of squares, `"adjR2"` is adjusted R-squared, and `"Rsq"` or `"R2"` is R-squared. Capitalization is ignored. For methods based on the LASSO, only `"CV", "AIC", "BIC", "EBIC"` are available. For methods based on subset selection, only `"Cp", "BIC", "RSS", "adjR2", "R2"` are available.
`method`	Character string to indicate which method to use for variable selection. Options include `"lasso"` and `"glmnet"`, both of which use the LASSO via the `glmnet` package (either with `glmnet::glmnet` or `glmnet::cv.glmnet`, depending upon the criterion). `"subset", "backward", "forward", "seqrep"`, all call different types of subset selection using the `leaps::regsubsets` function. Finally `"glinternet"` is used for applying the hierarchical lasso, and is the only method available for moderated network estimation (either with `glinternet::glinternet` or `glinternet::glinternet.cv`, depending upon the criterion). If one or more moderators are specified, then `method` will automatically default to `"glinternet"`.
`rule`	Only applies to GGMs (including between-subjects networks) when a threshold is supplied. The `"AND"` rule will only preserve edges when both corresponding coefficients have p-values below the threshold, while the `"OR"` rule will preserve an edge so long as one of the two coefficients have a p-value below the supplied threshold.
`gamma`	Numeric value of the hyperparameter for the `"EBIC"` criterion. Only relevant if `criterion = "EBIC"`. Recommended to use a value between 0 and .5, where larger values impose a larger penalty on the criterion.
`nfolds`	Only relevant if `criterion = "CV"`. Determines the number of folds to use in cross-validation.
`nlam`	if `method = "glinternet"`, determines the number of lambda values to evaluate in the selection path.
`which.lam`	Character string. Only applies if `criterion = "CV"`. Options include `"min"`, which uses the lambda value that minimizes the objective function, or `"1se"` which uses the lambda value at 1 standard error above the value that minimizes the objective function.
`threshold`	Logical or numeric. If `TRUE`, then a default value of .05 will be set. Indicates whether a threshold should be placed on the models at each iteration of the sampling. A significant choice by the researcher.
`bonf`	Logical. Determines whether to apply a bonferroni adjustment on the distribution of p-values for each coefficient.
`alpha`	Type 1 error rate. Defaults to .05.
`exogenous`	Logical. Indicates whether moderator variables should be treated as exogenous or not. If they are exogenous, they will not be modeled as outcomes/nodes in the network. If the number of moderators reaches `k - 1` or `k`, then `exogenous` will automatically be `FALSE`.
`split`	If `sampMethod == "split"` or `sampMethod = "stability"` then this is a value between 0 and 1 that indicates the proportion of the sample to be used for the training set. When `sampMethod = "stability"` there isn't an important distinction between the labels "training" and "test", although this value will still cause the two samples to be taken of complementary size.
`center`	Logical. Determines whether to mean-center the variables.
`scale`	Logical. Determines whether to standardize the variables.
`varSeed`	Numeric value providing a seed to be set at the beginning of the selection procedure. Recommended for reproducible results. Importantly, this seed will be used for the variable selection models at each iteration of the resampler. Caution this means that while each model is run with a different sample, it will always have the same seed.
`seed`	Can be a single value, to set a seed before drawing random seeds of length `niter` to be used across iterations. Alternatively, one can supply a vector of seeds of length `niter`. It is recommended to use this argument for reproducibility over the `varSeed` argument.
`verbose`	Logical. Determines whether information about the modeling progress should be displayed in the console.
`lags`	Numeric or logical. Can only be 0, 1 or `TRUE` or `FALSE`. `NULL` is interpreted as `FALSE`. Indicates whether to fit a time-lagged network or a GGM.
`binary`	Numeric vector indicating which columns of the data contain binary variables.
`type`	Determines whether to use gaussian models `"g"` or binomial models `"c"`. Can also just use `"gaussian"` or `"binomial"`. Moreover, a vector of length `k` can be provided such that a value is given to every variable. Ultimately this is not necessary, though, as such values are automatically detected.
`saveMods`	Logical. Indicates whether to save the models fit to the samples at each iteration or not.
`saveData`	Logical. Determines whether to save the data from each subsample across iterations or not.
`saveVars`	Logical. Determines whether to save the variable selection models at each iteration.
`fitit`	Logical. Determines whether to fit the final selected model on the original sample. If `FALSE`, then this can still be done with `fitNetwork` and `modSelect`.
`nCores`	Numeric value indicating the number of CPU cores to use for the resampling. If `TRUE`, then the `parallel::detectCores` function will be used to maximize the number of cores available.
`cluster`	Character vector indicating which type of parallelization to use, if `nCores > 1`. Options include `"mclapply"` and `"SOCK"`.
`block`	Logical or numeric. If specified, then this indicates that `lags != 0` or `lags != NULL`. If numeric, then this indicates that block bootstrapping will be used, and the value specifies the block size. If `TRUE` then an appropriate block size will be estimated automatically.
`beepno`	Character string or numeric value to indicate which variable (if any) encodes the survey number within a single day. Must be used in conjunction with `dayno` argument.
`dayno`	Character string or numeric value to indiciate which variable (if any) encodes the survey number within a single day. Must be used in conjunction with `beepno` argument.
`...`	Additional arguments.

Details

Sampling methods can be specified via the sampMethod argument.

Bootstrapped resampling: Standard bootstrapped resampling, wherein a bootstrapped sample of size n is drawn with replacement at each iteration. Then, a variable selection procedure is applied to the sample, and the selected model is fit to obtain the parameter values. P-values and confidence intervals for the parameter distributions are then estimated.
Multi-sample splitting: Involves taking two disjoint samples from the original data – a training sample and a test sample. At each iteration the variable selection procedure is applied to the training sample, and then the resultant model is fit on the test sample. Parameters are then aggregated based on the coefficients in the models fit to the test samples.
Stability selection: Stability selection begins the same as multi-sample splitting, in that two disjoint samples are drawn from the data at each iteration. However, the variable selection procedure is then applied to each of the two subsamples at each iteration. The objective is to compute the proportion of times that each predictor was selected in each subsample across iterations, as well as the proportion of times that it was simultaneously selected in both disjoint samples. At the end of the resampling, the final model is selected by setting a frequency threshold between 0 and 1, indicating the minimum proportion of samples that a variable would have to have been selected to be retained in the final model.

For the bootstrapping and multi-sample split methods, p-values are aggregated for each parameter using a method developed by Meinshausen, Meier, & Buhlmann (2009) that employs error control based on the false-discovery rate. The same procedure is employed for creating adjusted confidence intervals.

A key distinguishing feature of the bootstrapping procedure implemented in this function versus the bootNet function is that the latter is designed to estimate the parameter distributions of a single model, whereas the version here is aimed at using the bootstrapped resamples to select a final model. In a practical sense, this boils down to using the bootstrapping method in the resample function to perform variable selection at each iteration of the resampling, rather than taking a single constrained model and applying it equally at all iterations.

Value

resample output

References

Meinshausen, N., Meier, L., & Buhlmann, P. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association. 104, 1671-1681.

Meinshausen, N., & Buhlmann, P. (2010). Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 72, 417-423

Examples


fit1 <- resample(ggmDat, m = 'M', niter = 10)

net(fit1)
netInts(fit1)

plot(fit1)
plot(fit1, what = 'coefs')
plot(fit1, what = 'bootstrap', multi = TRUE)
plot(fit1, what = 'pvals', outcome = 2, predictor = 4)

fit2 <- resample(gvarDat, m = 'M', niter = 10, lags = 1, sampMethod = 'stability')

plot(fit2, what = 'stability', outcome = 3)

[Package modnets version 0.9.0 Index]