R: Joint feature screening via sparse maximum likelihood...

SMLE {SMLE}

R Documentation

Joint feature screening via sparse maximum likelihood estimation for GLMs

Description

Input a n by 1 response Y and a n by p feature matrix X; the function uses SMLE to retain only a set of k<n features that seem to be most related to the response variable. It thus serves as a pre-processing step for an elaborative analysis. In SMLE, the joint effects between features are naturally accounted for; this makes the screening more reliable. The function uses the efficient iterative hard thresholding (IHT) algorithm with step parameter adaptively tuned for fast convergence. Users can choose to further conduct an elaborative selection after SMLE-screening. See smle_select() for more details.

Usage

SMLE(formula = NULL, ...)

## Default S3 method:
SMLE(
  formula = NULL,
  X = NULL,
  Y = NULL,
  data = NULL,
  k = NULL,
  family = c("gaussian", "binomial", "poisson"),
  keyset = NULL,
  intercept = TRUE,
  categorical = TRUE,
  group = TRUE,
  codingtype = NULL,
  coef_initial = NULL,
  max_iter = 500,
  tol = 10^(-3),
  selection = F,
  standardize = TRUE,
  fast = FALSE,
  U = 1,
  U_rate = 0.5,
  penalize_mod = TRUE,
  ...
)

## S3 method for class 'formula'
SMLE(formula, data, k = NULL, keyset = NULL, categorical = NULL, ...)

Arguments

`formula`	An object of class `'formula'` (or one that can be coerced to that class): a symbolic description of the model to be fitted. It should be `NULL` when `X` and `Y` are used.
`...`	Additional arguments to be passed to `smle_select()` if `selection = TRUE`. See `smle_select()` documentation for more details.
`X`	The `n` by `p` feature matrix `X` with each column denoting a feature (covariate) and each row denoting an observation vector. The input should be a `'matrix'` object for numerical data, and `'data.frame'` for categorical data (or a mixture of numerical and categorical data). The algorithm will treat covariates having class `'factor'` as categorical data and extend the data frame dimension by the dummy columns needed for coding the categorical features.
`Y`	The response vector `Y` of dimension `n` by `1`. Quantitative for `family = "gaussian"`, non-negative counts for `family = "poisson"`, binary (0-1) for `family = "binomial"`. Input `Y` should be `'numeric'`.
`data`	An optional data frame, list or environment (or object coercible by `as.data.frame()` to a `'data.frame'`) containing the features in the model. It is required if `'formula'` is used.
`k`	Total number of features (including `keyset`) to be retained after screening. Default is the largest integer not exceeding `0.5`log`(n) n^{1/3}`.
`family`	Model assumption between `Y` and `X`; either a character string representing one of the built-in families, or else a glm() family object. The default model is Gaussian linear.
`keyset`	A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model. The column indices for the key features should be from `data` if `'formula'` is used or in `X` if `X` and `Y` are provided. The class of `keyset` can be `'numeric'`,`'integer'` or `'character'`. Default is `NULL`.
`intercept`	A logical flag to indicate whether to an intercept be used in the model. An intercept will not participate in screening.
`categorical`	A logical flag for whether the input feature matrix includes categorical features( either `'factor'` or `'character'`). `FALSE` treats all features as numerical and not check for whether there are categorical features; `TRUE` treats the data as having some categorical features and the algorithm determines which columns contain the categorical features. If all features are known to be numerical, it will be faster to run SMLE with this argument set to `FALSE`. we will need to find which columns are the categorical features. Default is `TRUE`.
`group`	Logical flag for whether to treat the dummy covariates of a categorical feature as a group. (Only for categorical data, see Details). Default is `TRUE`.
`codingtype`	Coding types for categorical features; default is `"DV"`. `codingtype = "all"` convert each level to a 0-1 vector. `codingtype = "DV"` conducts deviation coding for each level in comparison with the grand mean. `codingtype = "standard"` conducts standard dummy coding for each level in comparison with the reference level (first level).
`coef_initial`	A `p`-dimensional vector for the initial coefficient value of the IHT algorithm. The default is to use Lasso with the sparsity closest to `n-1`.
`max_iter`	Maximum number of iteration steps. Default is 500.
`tol`	A tolerance level to stop the iterations, when the squared sum of differences between two successive coefficient updates is below it. Default is `10^{-3}`.
`selection`	A logical flag to indicate whether an elaborate selection is to be conducted by `smle_select()` after screening. If `TRUE`, the function will return a `'selection'` object, see `smle_select()` documentation. Default is `FALSE`.
`standardize`	A logical flag for feature standardization, prior to performing feature screening. The resulting coefficients are always returned on the original scale. If features are in the same units already, you might not wish to standardize. Default is `standardize = TRUE`.
`fast`	Set to `TRUE` to enable early stop for SMLE-screening. It may help to boost the screening efficiency with a little sacrifice of accuracy. Default is `FALSE`, see Details.
`U`	A numerical multiplier of initial tuning step parameter in IHT algorithm. Default is 1. For binomial model, a larger initial value is recommended; a smaller one is recommended for poisson model.
`U_rate`	Decreasing rate in tuning step parameter `1/u` in IHT algorithm. See Details.
`penalize_mod`	A logical flag to indicate whether adjustment is used in ranking groups of features. This argument is applicable only when `categorical = TRUE` with `group = TRUE`. When `penalize_mod = TRUE`, a factor of `\sqrt J` is divided from the `L_2` effect of a group with `J` members. Default is `TRUE`.

Details

With the input Y and X, SMLE() conducts joint feature screening by running iterative hard thresholding algorithm (IHT), where the default initial value is set to be the Lasso estimate with the sparsity closest to the sample size minus one.

In SMLE(), the initial value for step size parameter 1/u is determined as follows. When coef_initial = 0, we set 1/u = U / \sqrt{p}. When coef_initial != 0, we generate a sub-matrix X_0 using the columns of X corresponding to the non-zero positions of coef_initial and set 1/u = U/\sqrt{p}||X||^2_{\infty} and recursively decrease the value of step size by U_rate to guarantee the likelihood increment. This strategy is called u-search.

SMLE() terminates IHT iterations when either tol or max_iter is satisfied. When fast = TRUE, the algorithm also stops when the non-zero members of the coefficient estimates remain the same for 10 successive iterations or the log-likelihood difference between coefficient estimates is less than 0.01 times the log-likelihood increase of the first step, or tol\sqrt k is satisfied.

In SMLE(), categorical features are coded by dummy covariates with the method specified in codingtype. Users can use group to specify whether to treat those dummy covariates as a single group feature or as individual features. When group = TRUE with penalize_mod = TRUE, the effect for a group of J dummy covariates is computed by

\beta_i = \sqrt{(\beta_1)^2+...+(\beta_J)^2}/\sqrt J,

which will be treated as a single feature in IHT iterations. When group = FALSE, a group of J dummy covariates will be treated as J individual features in the IHT iterations; in this case, a categorical feature is retained after screening when at least one of the corresponding dummy covariates is retained.

Since feature screening is usually a preprocessing step, users may wish to further conduct an elaborative feature selection after screening. This can be done by setting selection = TRUE in SMLE() or applying any existing selection method on the output of SMLE().

Value

`call`	The call that produced this object.
`ID_retained`	A vector indicating the features retained after SMLE-screening. The output includes both features retained by `SMLE()` and the features specified in `keyset`.
`coef_retained`	The vector of coefficients estimated by IHT for the retained features. When the retained set contains a categorical feature, the value returns a group effect if `group = TRUE`, or returns the strongest dummy covariate effect if `group = FALSE`.
`path_retained`	IHT iteration path with columns recording the coefficient updates.
`num_retained`	Number of retained features after screening.
`intercept`	The estimated intercept value by IHT, if `intercept = TRUE`.
`steps`	Number of IHT iterations.
`likelihood_iter`	A list of log-likelihood updates over the IHT iterations.
`Usearch`	A vector giving the number of attempts to find a proper `1/u` at each iteration step.
`modified_data`	A list containing data objects generated by SMLE. `CM`: Design matrix of class `'matrix'` for numeric features (or `'data.frame'` with categorical features). `DM`: A matrix with dummy variable features added. (only if there are categorical features). `dum_col`: Number of levels for all categorical features. `CI`: Indices of categorical features in `CM`. `DFI`: Indices of categorical features in `IM`.
`iteration_data`	A list containing data objects that track the coefficients over iterations. `IM`: Iteration path matrix with columns recording IHT coefficient updates. `beta0`: Inital value of regression coefficient for IHT. `feature_name`: A list contains the names of selected features. `FD`: A matrix that contains feature indices retained at each iteration step.

X, Y, data, family, categorical and codingtype are return of arguments passed in the function call.

References

UCLA Statistical Consulting Group. coding systems for categorical variables in regression analysis. https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/. Retrieved May 28, 2020.

Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269.

Examples


# Example 1:
set.seed(1)
Data <- Gen_Data( n= 200, p = 5000, family = "gaussian", correlation = "ID")
fit <- SMLE( Y = Data$Y , X = Data$X, k = 9,family = "gaussian")
summary(fit)
Data$subset_true %in% fit$ID_retained # Sure screening check.
plot(fit)

# Example 2:
set.seed(1)
Data_sim2 <- Gen_Data(n = 420, p = 1000, family = "gaussian", num_ctgidx = 5, 
                      pos_ctgidx = c(1,3,5,7,9), effect_truecoef= c(1,2,3,-4,-5),
                      pos_truecoef = c(1,3,5,7,8), level_ctgidx = c(3,3,3,4,5))
train_X <- Data_sim2$X[1:400,]; test_X <- Data_sim2$X[401:420,]
train_Y <- Data_sim2$Y[1:400]; test_Y <- Data_sim2$Y[401:420]
fit <- SMLE(Y = train_Y, X = train_X, family = "gaussian", group = TRUE, k = 15)
predict(fit, newdata = test_X)
test_Y

# Example 3:
library(datasets)
data("attitude")
set.seed(1)
noise <- matrix(rnorm(30*100, mean = mean(attitude$rating) , sd = 1), ncol = 100)
colnames(noise) <- paste("Noise", seq(100), sep = ".")
df <- data.frame(cbind(attitude, noise))
fit <- SMLE(rating ~., data = df)
fit

[Package SMLE version 2.1-1 Index]