cv.saenet {miselect}R Documentation

Cross Validated Multiple Imputation Stacked Adaptive Elastic Net

Description

Does k-fold cross-validation for saenet, and returns optimal values for lambda and alpha.

Usage

cv.saenet(
  x,
  y,
  pf,
  adWeight,
  weights,
  family = c("gaussian", "binomial"),
  alpha = 1,
  nlambda = 100,
  lambda.min.ratio = ifelse(isTRUE(all.equal(adWeight, rep(1, p))), 0.001, 1e-06),
  lambda = NULL,
  nfolds = 5,
  foldid = NULL,
  maxit = 1000,
  eps = 1e-05
)

Arguments

x

A length m list of n * p numeric matrices. No matrix should contain an intercept, or any missing values

y

A length m list of length n numeric response vectors. No vector should contain missing values

pf

Penalty factor of length p. Can be used to differentially penalize certain variables. 0 indicates to not penalize the covariate

adWeight

Numeric vector of length p representing the adaptive weights for the L1 penalty

weights

Numeric vector of length n containing the proportion observed (non-missing) for each row in the un-imputed data.

family

The type of response. "gaussian" implies a continuous response and "binomial" implies a binary response. Default is "gaussian".

alpha

Elastic net parameter. Can be a vector to cross validate over. Default is 1

nlambda

Length of automatically generated "lambda" sequence. If "lambda" is non NULL, "nlambda" is ignored. Default is 100

lambda.min.ratio

Ratio that determines the minimum value of "lambda" when automatically generating a "lambda" sequence. If "lambda" is not NULL, "lambda.min.ratio" is ignored. Default is 1e-3

lambda

Optional numeric vector of lambdas to fit. If NULL, galasso will automatically generate a lambda sequence based off of nlambda and lambda.min.ratio. Default is NULL

nfolds

Number of foldid to use for cross validation. Default is 5, minimum is 3

foldid

an optional length n vector of values between 1 and cv.galasso will automatically generate folds

maxit

Maximum number of iterations to run. Default is 1000

eps

Tolerance for convergence. Default is 1e-5

Details

cv.saenet works by stacking the multiply imputed data into a single matrix and running a weighted adaptive elastic net on it. Simulations suggest that the "stacked" objective function approaches tend to be more computationally efficient and have better estimation and selection properties.

Due to stacking, the automatically generated lambda sequence cv.saenet generates may end up underestimating lambda.max, and thus the degrees of freedom may be nonzero at the first lambda value.

Value

An object of type "cv.saenet" with 9 elements:

call

The call that generated the output.

lambda

Sequence of lambdas fit.

cvm

Average cross validation error for each lambda and alpha. For family = "gaussian", "cvm" corresponds to mean squared error, and for binomial "cvm" corresponds to deviance.

cvse

Standard error of "cvm".

saenet.fit

A "saenet" object fit to the full data.

lambda.min

The lambda value for the model with the minimum cross validation error.

lambda.1se

The lambda value for the sparsest model within one standard error of the minimum cross validation error.

alpha.min

The alpha value for the model with the minimum cross validation error.

alpha.1se

The alpha value for the sparsest model within one standard error of the minimum cross validation error.

df

The number of nonzero coefficients for each value of lambda and alpha.

References

Du, J., Boss, J., Han, P., Beesley, L. J., Kleinsasser, M., Goutman, S. A., ... & Mukherjee, B. (2022). Variable selection with multiply-imputed datasets: choosing between stacked and grouped methods. Journal of Computational and Graphical Statistics, 31(4), 1063-1075. <doi:10.1080/10618600.2022.2035739>

Examples


library(miselect)
library(mice)

set.seed(48109)

# Using the mice defaults for sake of example only.
mids <- mice(miselect.df, m = 5, printFlag = FALSE)
dfs <- lapply(1:5, function(i) complete(mids, action = i))

# Generate list of imputed design matrices and imputed responses
x <- list()
y <- list()
for (i in 1:5) {
    x[[i]] <- as.matrix(dfs[[i]][, paste0("X", 1:20)])
    y[[i]] <- dfs[[i]]$Y
}

# Calculate observational weights
weights  <- 1 - rowMeans(is.na(miselect.df))
pf       <- rep(1, 20)
adWeight <- rep(1, 20)

# Since 'Y' is a binary variable, we use 'family = "binomial"'
fit <- cv.saenet(x, y, pf, adWeight, weights, family = "binomial")

# By default 'coef' returns the betas for (lambda.min , alpha.min)
coef(fit)


# You can also cross validate over alpha

fit <- cv.saenet(x, y, pf, adWeight, weights, family = "binomial",
                 alpha = c(.5, 1))
# Get selected variables from the 1 standard error rule
coef(fit, lambda = fit$lambda.1se, alpha = fit$alpha.1se)



[Package miselect version 0.9.2 Index]