cv.grpsel {grpsel}R Documentation

Cross-validated group subset selection

Description

Fits the regularisation surface for a regression model with a group subset selection penalty and then cross-validates this surface.

Usage

cv.grpsel(
  x,
  y,
  group = seq_len(ncol(x)),
  penalty = c("grSubset", "grSubset+grLasso", "grSubset+Ridge"),
  loss = c("square", "logistic"),
  lambda = NULL,
  gamma = NULL,
  nfold = 10,
  folds = NULL,
  cv.loss = NULL,
  cluster = NULL,
  interpolate = TRUE,
  ...
)

Arguments

x

a predictor matrix

y

a response vector

group

a vector of length ncol(x) with the jth element identifying the group that the jth predictor belongs to; alternatively, a list of vectors with the kth vector identifying the predictors that belong to the kth group (useful for overlapping groups)

penalty

the type of penalty to apply; one of 'grSubset', 'grSubset+grLasso', or 'grSubset+Ridge'

loss

the type of loss function to use; 'square' for linear regression or 'logistic' for logistic regression

lambda

an optional list of decreasing sequences of group subset selection parameters; the list should contain a vector for each value of gamma

gamma

an optional decreasing sequence of group lasso or ridge parameters

nfold

the number of cross-validation folds

folds

an optional vector of length nrow(x) with the ith entry identifying the fold that the ith observation belongs to

cv.loss

an optional cross-validation loss-function to use; should accept a vector of predicted values and a vector of actual values

cluster

an optional cluster for running cross-validation in parallel; must be set up using parallel::makeCluster; each fold is evaluated on a different node of the cluster

interpolate

a logical indicating whether to interpolate the lambda sequence for the cross-validation fits; see details below

...

any other arguments for grpsel()

Details

When loss='logistic' stratified cross-validation is used to balance the folds. When fitting to the cross-validation folds, interpolate=TRUE cross-validates the midpoints between consecutive lambda values rather than the original lambda sequence. This new sequence retains the same set of solutions on the full data, but often leads to superior cross-validation performance.

Value

An object of class cv.grpsel; a list with the following components:

cv.mean

a list of vectors containing cross-validation means per value of lambda; an individual vector in the list for each value of gamma

cd.sd

a list of vectors containing cross-validation standard errors per value of lambda; an individual vector in the list for each value of gamma

lambda

a list of vectors containing the values of lambda used in the fit; an individual vector in the list for each value of gamma

gamma

a vector containing the values of gamma used in the fit

lambda.min

the value of lambda minimising cv.mean

gamma.min

the value of gamma minimising cv.mean

fit

the fit from running grpsel() on the full data

Author(s)

Ryan Thompson <ryan.thompson@monash.edu>

Examples

# Grouped data
set.seed(123)
n <- 100
p <- 10
g <- 5
group <- rep(1:g, each = p / g)
beta <- numeric(p)
beta[which(group %in% 1:2)] <- 1
x <- matrix(rnorm(n * p), n, p)
y <- rnorm(n, x %*% beta)
newx <- matrix(rnorm(p), ncol = p)

# Group subset selection
fit <- cv.grpsel(x, y, group)
plot(fit)
coef(fit)
predict(fit, newx)

# Parallel cross-validation
cl <- parallel::makeCluster(2)
fit <- cv.grpsel(x, y, group, cluster = cl)
parallel::stopCluster(cl)

[Package grpsel version 1.3.1 Index]