SMLE {SMLE} | R Documentation |
Joint feature screening via sparse maximum likelihood estimation for GLMs
Description
Input a n
by 1
response Y
and a n
by p
feature matrix X
;
the function uses SMLE to retain only a set of k<n
features that seem
to be most related to the response variable. It thus serves as a pre-processing step for an
elaborative analysis. In SMLE, the joint effects between features are naturally
accounted for; this makes the screening more reliable. The function uses the
efficient iterative hard thresholding (IHT) algorithm with step parameter
adaptively tuned for fast convergence. Users can choose to further conduct
an elaborative selection after SMLE-screening. See smle_select()
for more details.
Usage
SMLE(formula = NULL, ...)
## Default S3 method:
SMLE(
formula = NULL,
X = NULL,
Y = NULL,
data = NULL,
k = NULL,
family = c("gaussian", "binomial", "poisson"),
keyset = NULL,
intercept = TRUE,
categorical = TRUE,
group = TRUE,
codingtype = NULL,
coef_initial = NULL,
max_iter = 500,
tol = 10^(-3),
selection = F,
standardize = TRUE,
fast = FALSE,
U = 1,
U_rate = 0.5,
penalize_mod = TRUE,
...
)
## S3 method for class 'formula'
SMLE(formula, data, k = NULL, keyset = NULL, categorical = NULL, ...)
Arguments
formula |
An object of class |
... |
Additional arguments to be passed to |
X |
The |
Y |
The response vector |
data |
An optional data frame, list or environment (or object coercible by |
k |
Total number of features (including |
family |
Model assumption between |
keyset |
A numeric vector with column indices for the key features that
do not participate in feature screening and are forced to remain in the model.
The column indices for the key features should be from |
intercept |
A logical flag to indicate whether to an intercept be used in the model. An intercept will not participate in screening. |
categorical |
A logical flag for whether the input feature matrix includes
categorical features( either |
group |
Logical flag for whether to treat the dummy covariates of a
categorical feature as a group. (Only for categorical data, see Details).
Default is |
codingtype |
Coding types for categorical features; default is |
coef_initial |
A |
max_iter |
Maximum number of iteration steps. Default is 500. |
tol |
A tolerance level to stop the iterations, when the squared sum of
differences between two successive coefficient updates is below it.
Default is |
selection |
A logical flag to indicate whether an elaborate selection
is to be conducted by |
standardize |
A logical flag for feature standardization, prior to
performing feature screening. The resulting coefficients are
always returned on the original scale.
If features are in the same units already, you might not wish to
standardize. Default is |
fast |
Set to |
U |
A numerical multiplier of initial tuning step parameter in IHT algorithm. Default is 1. For binomial model, a larger initial value is recommended; a smaller one is recommended for poisson model. |
U_rate |
Decreasing rate in tuning step parameter |
penalize_mod |
A logical flag to indicate whether adjustment is used in
ranking groups of features. This argument is applicable only when
|
Details
With the input Y
and X
, SMLE()
conducts joint feature screening by running
iterative hard thresholding algorithm (IHT), where the default initial value is set to
be the Lasso estimate with the sparsity closest to the sample size minus one.
In SMLE()
, the initial value for step size parameter 1/u
is
determined as follows. When coef_initial = 0
, we set 1/u = U / \sqrt{p}
.
When coef_initial != 0
, we generate a sub-matrix X_0
using the columns of X
corresponding to the non-zero positions of coef_initial
and set
1/u = U/\sqrt{p}||X||^2_{\infty}
and recursively decrease the value of step size by
U_rate
to guarantee the likelihood increment. This strategy is called u
-search.
SMLE()
terminates IHT iterations when either tol
or max_iter
is
satisfied. When fast = TRUE
, the algorithm also stops when the non-zero
members of the coefficient estimates remain the same for 10 successive
iterations or the log-likelihood difference between coefficient estimates is less
than 0.01
times the log-likelihood increase of the first step, or
tol
\sqrt k
is satisfied.
In SMLE()
, categorical features are coded by dummy covariates with the
method specified in codingtype
. Users can use group
to specify
whether to treat those dummy covariates as a single group feature or as
individual features.
When group = TRUE
with penalize_mod = TRUE
, the effect for a group
of J
dummy covariates is computed by
\beta_i = \sqrt{(\beta_1)^2+...+(\beta_J)^2}/\sqrt J,
which will be treated as a single feature in IHT iterations. When group = FALSE
,
a group of J
dummy covariates will be treated as J
individual features in the IHT iterations; in this case,
a categorical feature is retained after screening when at least one of the corresponding dummy covariates is retained.
Since feature screening is usually a preprocessing step, users may wish to
further conduct an elaborative feature selection after screening. This can
be done by setting selection = TRUE
in SMLE()
or applying any existing
selection method on the output of SMLE()
.
Value
call |
The call that produced this object. |
ID_retained |
A vector indicating the features retained after SMLE-screening.
The output includes both features retained by |
coef_retained |
The vector of coefficients estimated by IHT for the retained features. When the
retained set contains a categorical feature, the value returns a group effect if
|
path_retained |
IHT iteration path with columns recording the coefficient updates. |
num_retained |
Number of retained features after screening. |
intercept |
The estimated intercept value by IHT, if |
steps |
Number of IHT iterations. |
likelihood_iter |
A list of log-likelihood updates over the IHT iterations. |
Usearch |
A vector giving the number of attempts to find a proper |
modified_data |
A list containing data objects generated by SMLE.
|
iteration_data |
A list containing data objects that track the coefficients over iterations.
|
X
, Y
, data
, family
, categorical
and codingtype
are return of arguments passed in the function call.
References
UCLA Statistical Consulting Group. coding systems for categorical variables in regression analysis. https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/. Retrieved May 28, 2020.
Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269.
Examples
# Example 1:
set.seed(1)
Data <- Gen_Data( n= 200, p = 5000, family = "gaussian", correlation = "ID")
fit <- SMLE( Y = Data$Y , X = Data$X, k = 9,family = "gaussian")
summary(fit)
Data$subset_true %in% fit$ID_retained # Sure screening check.
plot(fit)
# Example 2:
set.seed(1)
Data_sim2 <- Gen_Data(n = 420, p = 1000, family = "gaussian", num_ctgidx = 5,
pos_ctgidx = c(1,3,5,7,9), effect_truecoef= c(1,2,3,-4,-5),
pos_truecoef = c(1,3,5,7,8), level_ctgidx = c(3,3,3,4,5))
train_X <- Data_sim2$X[1:400,]; test_X <- Data_sim2$X[401:420,]
train_Y <- Data_sim2$Y[1:400]; test_Y <- Data_sim2$Y[401:420]
fit <- SMLE(Y = train_Y, X = train_X, family = "gaussian", group = TRUE, k = 15)
predict(fit, newdata = test_X)
test_Y
# Example 3:
library(datasets)
data("attitude")
set.seed(1)
noise <- matrix(rnorm(30*100, mean = mean(attitude$rating) , sd = 1), ncol = 100)
colnames(noise) <- paste("Noise", seq(100), sep = ".")
df <- data.frame(cbind(attitude, noise))
fit <- SMLE(rating ~., data = df)
fit