smle_select {SMLE} | R Documentation |
Elaborative post-screening selection with SMLE
Description
The features retained after screening are still likely to contain some that
are not related to the response. The function smle_select()
is designed to
further identify the relevant features using SMLE()
.
Given a response and a set of K
features, this function
first runs SMLE(fast = TRUE)
to generate a series of sub-models with
sparsity k varying from k_min
to k_max
.
It then selects the best model from the series based on a selection criterion.
When criterion EBIC is used, users can choose to repeat the selection with
different values of the tuning parameter \gamma
, and
conduct importance voting for each feature. When vote = T
, this function
fits all the models with \gamma
specified in gamma_seq
and features
with frequency higher than vote_threshold
will be selected in ID_voted
.
Usage
smle_select(object, ...)
## S3 method for class 'sdata'
smle_select(
object,
k_min = 1,
k_max = NULL,
subset = NULL,
gamma_ebic = 0.5,
vote = FALSE,
keyset = NULL,
criterion = "ebic",
codingtype = c("DV", "standard", "all"),
gamma_seq = c(seq(0, 1, 0.2)),
vote_threshold = 0.6,
parallel = FALSE,
num_clusters = NULL,
...
)
## Default S3 method:
smle_select(
object = NULL,
Y = NULL,
X = NULL,
family = "gaussian",
keyset = NULL,
...
)
## S3 method for class 'smle'
smle_select(object, ...)
Arguments
object |
Object of class |
... |
Further arguments passed to or from other methods. |
k_min |
The lower bound of candidate model sparsity. Default is 1. |
k_max |
The upper bound of candidate model sparsity. Default is the number of columns in feature matrix. |
subset |
An index vector indicating which features (columns of the
feature matrix) are to be selected. Not applicable if a |
gamma_ebic |
The EBIC tuning parameter, in |
vote |
The logical flag for whether to perform the voting procedure. Only available when |
keyset |
A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model. See SMLE for details. |
criterion |
Selection criterion. One of " |
codingtype |
Coding types for categorical features; for more details see |
gamma_seq |
The sequence of values for |
vote_threshold |
A relative voting threshold in percentage. A feature is considered to be important when it receives votes passing the threshold. Default is 0.6. |
parallel |
A logical flag to use parallel computing to do voting selection.
Default is |
num_clusters |
The number of compute clusters to use when
|
Y |
Input response vector (when |
X |
Input features matrix (when |
family |
Model assumption; see When input is a |
Details
This function accepts three types of input objects;
1) 'smle'
object, as the output from SMLE()
;
2) 'sdata'
object, as the output from Gen_Data()
;
3) other response and feature matrix input by users.
Note that this function is mainly designed to conduct an elaborative selection after feature screening. We do not recommend using it directly for ultra-high-dimensional data without screening.
Value
call |
The call that produced this object. |
ID_selected |
A list of selected features. |
coef_selected |
Fitted model coefficients. |
intercept |
Fitted model intercept. |
criterion_value |
Values of selection criterion for the candidate models with various sparsity. |
categorical |
A logical flag whether the input feature matrix includes categorical features |
ID_pool |
A vector containing all features selected during voting. |
ID_voted |
A vector containing the features selected when |
CI |
Indices of categorical features when |
X
, Y
, family
, gamma_ebic
, gamma_seq
, criterion
, vote
,
codyingtype
, vote_threshold
are return of arguments passed in the function call.
References
Chen. J. and Chen. Z. (2012). "Extended BIC for small-n-large-p sparse GLM." Statistica Sinica, 22(2), 555-574.
Examples
set.seed(1)
Data<-Gen_Data(correlation = "MA", family = "gaussian")
fit<-SMLE(Y = Data$Y, X = Data$X, k = 20, family = "gaussian")
fit_bic<-smle_select(fit, criterion = "bic")
summary(fit_bic)
fit_ebic<-smle_select(fit, criterion = "ebic", vote = TRUE)
summary(fit_ebic)
plot(fit_ebic)