R: Elaborative post-screening selection with SMLE

smle_select {SMLE}

R Documentation

Elaborative post-screening selection with SMLE

Description

The features retained after screening are still likely to contain some that are not related to the response. The function smle_select() is designed to further identify the relevant features using SMLE(). Given a response and a set of K features, this function first runs SMLE(fast = TRUE) to generate a series of sub-models with sparsity k varying from k_min to k_max. It then selects the best model from the series based on a selection criterion.

When criterion EBIC is used, users can choose to repeat the selection with different values of the tuning parameter \gamma, and conduct importance voting for each feature. When vote = T, this function fits all the models with \gamma specified in gamma_seq and features with frequency higher than vote_threshold will be selected in ID_voted.

Usage

smle_select(object, ...)

## S3 method for class 'sdata'
smle_select(
  object,
  k_min = 1,
  k_max = NULL,
  subset = NULL,
  gamma_ebic = 0.5,
  vote = FALSE,
  keyset = NULL,
  criterion = "ebic",
  codingtype = c("DV", "standard", "all"),
  gamma_seq = c(seq(0, 1, 0.2)),
  vote_threshold = 0.6,
  parallel = FALSE,
  num_clusters = NULL,
  ...
)

## Default S3 method:
smle_select(
  object = NULL,
  Y = NULL,
  X = NULL,
  family = "gaussian",
  keyset = NULL,
  ...
)

## S3 method for class 'smle'
smle_select(object, ...)

Arguments

`object`	Object of class `'smle'` or `'sdata'`. Users can also input a response vector and a feature matrix.
`...`	Further arguments passed to or from other methods.
`k_min`	The lower bound of candidate model sparsity. Default is 1.
`k_max`	The upper bound of candidate model sparsity. Default is the number of columns in feature matrix.
`subset`	An index vector indicating which features (columns of the feature matrix) are to be selected. Not applicable if a `'smle'` object is the input.
`gamma_ebic`	The EBIC tuning parameter, in `[0 , 1]`. Default is 0.5.
`vote`	The logical flag for whether to perform the voting procedure. Only available when `criterion = "ebic"`.
`keyset`	A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model. See SMLE for details.
`criterion`	Selection criterion. One of "`ebic`","`bic`","`aic`". Default is "`ebic`".
`codingtype`	Coding types for categorical features; for more details see `SMLE()` documentation.
`gamma_seq`	The sequence of values for `gamma_ebic` when `vote = TRUE`.
`vote_threshold`	A relative voting threshold in percentage. A feature is considered to be important when it receives votes passing the threshold. Default is 0.6.
`parallel`	A logical flag to use parallel computing to do voting selection. Default is `FALSE`. See Details.
`num_clusters`	The number of compute clusters to use when `parallel = TRUE`. The default will be 2 times cores detected.
`Y`	Input response vector (when `object = NULL`).
`X`	Input features matrix (when `object = NULL`).
`family`	Model assumption; see `SMLE()` documentation. Default is Gaussian linear. When input is a `'smle'` or `'sdata'` object, the same model will be used in the selection.

Details

This function accepts three types of input objects; 1) 'smle' object, as the output from SMLE(); 2) 'sdata' object, as the output from Gen_Data(); 3) other response and feature matrix input by users.

Note that this function is mainly designed to conduct an elaborative selection after feature screening. We do not recommend using it directly for ultra-high-dimensional data without screening.

Value

`call`	The call that produced this object.
`ID_selected`	A list of selected features.
`coef_selected`	Fitted model coefficients.
`intercept`	Fitted model intercept.
`criterion_value`	Values of selection criterion for the candidate models with various sparsity.
`categorical`	A logical flag whether the input feature matrix includes categorical features
`ID_pool`	A vector containing all features selected during voting.
`ID_voted`	A vector containing the features selected when `vote = T`.
`CI`	Indices of categorical features when `categorical = TRUE`.

X, Y, family, gamma_ebic, gamma_seq, criterion, vote, codyingtype, vote_threshold are return of arguments passed in the function call.

References

Chen. J. and Chen. Z. (2012). "Extended BIC for small-n-large-p sparse GLM." Statistica Sinica, 22(2), 555-574.

Examples


set.seed(1)
Data<-Gen_Data(correlation = "MA", family = "gaussian")
fit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")

fit_bic<-smle_select(fit, criterion = "bic")
summary(fit_bic)

fit_ebic<-smle_select(fit, criterion = "ebic", vote = TRUE)
summary(fit_ebic)
plot(fit_ebic)

[Package SMLE version 2.1-1 Index]