R: Cross-validated group subset selection

cv.grpsel {grpsel}

R Documentation

Cross-validated group subset selection

Description

Fits the regularisation surface for a regression model with a group subset selection penalty and then cross-validates this surface.

Usage

cv.grpsel(
  x,
  y,
  group = seq_len(ncol(x)),
  penalty = c("grSubset", "grSubset+grLasso", "grSubset+Ridge"),
  loss = c("square", "logistic"),
  lambda = NULL,
  gamma = NULL,
  nfold = 10,
  folds = NULL,
  cv.loss = NULL,
  cluster = NULL,
  interpolate = TRUE,
  ...
)

Arguments

`x`	a predictor matrix
`y`	a response vector
`group`	a vector of length `ncol(x)` with the jth element identifying the group that the jth predictor belongs to; alternatively, a list of vectors with the kth vector identifying the predictors that belong to the kth group (useful for overlapping groups)
`penalty`	the type of penalty to apply; one of 'grSubset', 'grSubset+grLasso', or 'grSubset+Ridge'
`loss`	the type of loss function to use; 'square' for linear regression or 'logistic' for logistic regression
`lambda`	an optional list of decreasing sequences of group subset selection parameters; the list should contain a vector for each value of `gamma`
`gamma`	an optional decreasing sequence of group lasso or ridge parameters
`nfold`	the number of cross-validation folds
`folds`	an optional vector of length `nrow(x)` with the ith entry identifying the fold that the ith observation belongs to
`cv.loss`	an optional cross-validation loss-function to use; should accept a vector of predicted values and a vector of actual values
`cluster`	an optional cluster for running cross-validation in parallel; must be set up using `parallel::makeCluster`; each fold is evaluated on a different node of the cluster
`interpolate`	a logical indicating whether to interpolate the `lambda` sequence for the cross-validation fits; see details below
`...`	any other arguments for `grpsel()`

Details

When loss='logistic' stratified cross-validation is used to balance the folds. When fitting to the cross-validation folds, interpolate=TRUE cross-validates the midpoints between consecutive lambda values rather than the original lambda sequence. This new sequence retains the same set of solutions on the full data, but often leads to superior cross-validation performance.

Value

An object of class cv.grpsel; a list with the following components:

`cv.mean`	a list of vectors containing cross-validation means per value of `lambda`; an individual vector in the list for each value of `gamma`
`cd.sd`	a list of vectors containing cross-validation standard errors per value of `lambda`; an individual vector in the list for each value of `gamma`
`lambda`	a list of vectors containing the values of `lambda` used in the fit; an individual vector in the list for each value of `gamma`
`gamma`	a vector containing the values of `gamma` used in the fit
`lambda.min`	the value of `lambda` minimising `cv.mean`
`gamma.min`	the value of `gamma` minimising `cv.mean`
`fit`	the fit from running `grpsel()` on the full data

Author(s)

Ryan Thompson <ryan.thompson@monash.edu>

Examples

# Grouped data
set.seed(123)
n <- 100
p <- 10
g <- 5
group <- rep(1:g, each = p / g)
beta <- numeric(p)
beta[which(group %in% 1:2)] <- 1
x <- matrix(rnorm(n * p), n, p)
y <- rnorm(n, x %*% beta)
newx <- matrix(rnorm(p), ncol = p)

# Group subset selection
fit <- cv.grpsel(x, y, group)
plot(fit)
coef(fit)
predict(fit, newx)

# Parallel cross-validation
cl <- parallel::makeCluster(2)
fit <- cv.grpsel(x, y, group, cluster = cl)
parallel::stopCluster(cl)

[Package grpsel version 1.3.1 Index]