SMLE {SMLE}R Documentation

Joint feature screening via sparse maximum likelihood estimation for GLMs

Description

Input a n by 1 response Y and a n by p feature matrix X; the function uses SMLE to retain only a set of k<n features that seem to be most related to the response variable. It thus serves as a pre-processing step for an elaborative analysis. In SMLE, the joint effects between features are naturally accounted for; this makes the screening more reliable. The function uses the efficient iterative hard thresholding (IHT) algorithm with step parameter adaptively tuned for fast convergence. Users can choose to further conduct an elaborative selection after SMLE-screening. See smle_select() for more details.

Usage

SMLE(formula = NULL, ...)

## Default S3 method:
SMLE(
  formula = NULL,
  X = NULL,
  Y = NULL,
  data = NULL,
  k = NULL,
  family = c("gaussian", "binomial", "poisson"),
  keyset = NULL,
  intercept = TRUE,
  categorical = TRUE,
  group = TRUE,
  codingtype = NULL,
  coef_initial = NULL,
  max_iter = 500,
  tol = 10^(-3),
  selection = F,
  standardize = TRUE,
  fast = FALSE,
  U = 1,
  U_rate = 0.5,
  penalize_mod = TRUE,
  ...
)

## S3 method for class 'formula'
SMLE(formula, data, k = NULL, keyset = NULL, categorical = NULL, ...)

Arguments

formula

An object of class 'formula' (or one that can be coerced to that class): a symbolic description of the model to be fitted. It should be NULL when X and Y are used.

...

Additional arguments to be passed to smle_select() if selection = TRUE. See smle_select() documentation for more details.

X

The n by p feature matrix X with each column denoting a feature (covariate) and each row denoting an observation vector. The input should be a 'matrix' object for numerical data, and 'data.frame' for categorical data (or a mixture of numerical and categorical data). The algorithm will treat covariates having class 'factor' as categorical data and extend the data frame dimension by the dummy columns needed for coding the categorical features.

Y

The response vector Y of dimension n by 1. Quantitative for family = "gaussian", non-negative counts for family = "poisson", binary (0-1) for family = "binomial". Input Y should be 'numeric'.

data

An optional data frame, list or environment (or object coercible by as.data.frame() to a 'data.frame') containing the features in the model. It is required if 'formula' is used.

k

Total number of features (including keyset) to be retained after screening. Default is the largest integer not exceeding 0.5log(n) n^{1/3}.

family

Model assumption between Y and X; either a character string representing one of the built-in families, or else a glm() family object. The default model is Gaussian linear.

keyset

A numeric vector with column indices for the key features that do not participate in feature screening and are forced to remain in the model. The column indices for the key features should be from data if 'formula' is used or in X if X and Y are provided. The class of keyset can be 'numeric','integer' or 'character'. Default is NULL.

intercept

A logical flag to indicate whether to an intercept be used in the model. An intercept will not participate in screening.

categorical

A logical flag for whether the input feature matrix includes categorical features( either 'factor' or 'character'). FALSE treats all features as numerical and not check for whether there are categorical features; TRUE treats the data as having some categorical features and the algorithm determines which columns contain the categorical features. If all features are known to be numerical, it will be faster to run SMLE with this argument set to FALSE. we will need to find which columns are the categorical features. Default is TRUE.

group

Logical flag for whether to treat the dummy covariates of a categorical feature as a group. (Only for categorical data, see Details). Default is TRUE.

codingtype

Coding types for categorical features; default is "DV". codingtype = "all" convert each level to a 0-1 vector. codingtype = "DV" conducts deviation coding for each level in comparison with the grand mean. codingtype = "standard" conducts standard dummy coding for each level in comparison with the reference level (first level).

coef_initial

A p-dimensional vector for the initial coefficient value of the IHT algorithm. The default is to use Lasso with the sparsity closest to n-1.

max_iter

Maximum number of iteration steps. Default is 500.

tol

A tolerance level to stop the iterations, when the squared sum of differences between two successive coefficient updates is below it. Default is 10^{-3}.

selection

A logical flag to indicate whether an elaborate selection is to be conducted by smle_select() after screening. If TRUE, the function will return a 'selection' object, see smle_select() documentation. Default is FALSE.

standardize

A logical flag for feature standardization, prior to performing feature screening. The resulting coefficients are always returned on the original scale. If features are in the same units already, you might not wish to standardize. Default is standardize = TRUE.

fast

Set to TRUE to enable early stop for SMLE-screening. It may help to boost the screening efficiency with a little sacrifice of accuracy. Default is FALSE, see Details.

U

A numerical multiplier of initial tuning step parameter in IHT algorithm. Default is 1. For binomial model, a larger initial value is recommended; a smaller one is recommended for poisson model.

U_rate

Decreasing rate in tuning step parameter 1/u in IHT algorithm. See Details.

penalize_mod

A logical flag to indicate whether adjustment is used in ranking groups of features. This argument is applicable only when categorical = TRUE with group = TRUE. When penalize_mod = TRUE, a factor of \sqrt J is divided from the L_2 effect of a group with J members. Default is TRUE.

Details

With the input Y and X, SMLE() conducts joint feature screening by running iterative hard thresholding algorithm (IHT), where the default initial value is set to be the Lasso estimate with the sparsity closest to the sample size minus one.

In SMLE(), the initial value for step size parameter 1/u is determined as follows. When coef_initial = 0, we set 1/u = U / \sqrt{p}. When coef_initial != 0, we generate a sub-matrix X_0 using the columns of X corresponding to the non-zero positions of coef_initial and set 1/u = U/\sqrt{p}||X||^2_{\infty} and recursively decrease the value of step size by U_rate to guarantee the likelihood increment. This strategy is called u-search.

SMLE() terminates IHT iterations when either tol or max_iter is satisfied. When fast = TRUE, the algorithm also stops when the non-zero members of the coefficient estimates remain the same for 10 successive iterations or the log-likelihood difference between coefficient estimates is less than 0.01 times the log-likelihood increase of the first step, or tol\sqrt k is satisfied.

In SMLE(), categorical features are coded by dummy covariates with the method specified in codingtype. Users can use group to specify whether to treat those dummy covariates as a single group feature or as individual features. When group = TRUE with penalize_mod = TRUE, the effect for a group of J dummy covariates is computed by

\beta_i = \sqrt{(\beta_1)^2+...+(\beta_J)^2}/\sqrt J,

which will be treated as a single feature in IHT iterations. When group = FALSE, a group of J dummy covariates will be treated as J individual features in the IHT iterations; in this case, a categorical feature is retained after screening when at least one of the corresponding dummy covariates is retained.

Since feature screening is usually a preprocessing step, users may wish to further conduct an elaborative feature selection after screening. This can be done by setting selection = TRUE in SMLE() or applying any existing selection method on the output of SMLE().

Value

call

The call that produced this object.

ID_retained

A vector indicating the features retained after SMLE-screening. The output includes both features retained by SMLE() and the features specified in keyset.

coef_retained

The vector of coefficients estimated by IHT for the retained features. When the retained set contains a categorical feature, the value returns a group effect if group = TRUE, or returns the strongest dummy covariate effect if group = FALSE.

path_retained

IHT iteration path with columns recording the coefficient updates.

num_retained

Number of retained features after screening.

intercept

The estimated intercept value by IHT, if intercept = TRUE.

steps

Number of IHT iterations.

likelihood_iter

A list of log-likelihood updates over the IHT iterations.

Usearch

A vector giving the number of attempts to find a proper 1/u at each iteration step.

modified_data

A list containing data objects generated by SMLE.

CM: Design matrix of class 'matrix' for numeric features (or 'data.frame' with categorical features).

DM: A matrix with dummy variable features added. (only if there are categorical features).

dum_col: Number of levels for all categorical features.

CI: Indices of categorical features in CM.

DFI: Indices of categorical features in IM.

iteration_data

A list containing data objects that track the coefficients over iterations.

IM: Iteration path matrix with columns recording IHT coefficient updates.

beta0: Inital value of regression coefficient for IHT.

feature_name: A list contains the names of selected features.

FD: A matrix that contains feature indices retained at each iteration step.

X, Y, data, family, categorical and codingtype are return of arguments passed in the function call.

References

UCLA Statistical Consulting Group. coding systems for categorical variables in regression analysis. https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/. Retrieved May 28, 2020.

Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature Screening, Journal of the American Statistical Association, 109(507), 1257-1269.

Examples


# Example 1:
set.seed(1)
Data <- Gen_Data( n= 200, p = 5000, family = "gaussian", correlation = "ID")
fit <- SMLE( Y = Data$Y , X = Data$X, k = 9,family = "gaussian")
summary(fit)
Data$subset_true %in% fit$ID_retained # Sure screening check.
plot(fit)

# Example 2:
set.seed(1)
Data_sim2 <- Gen_Data(n = 420, p = 1000, family = "gaussian", num_ctgidx = 5, 
                      pos_ctgidx = c(1,3,5,7,9), effect_truecoef= c(1,2,3,-4,-5),
                      pos_truecoef = c(1,3,5,7,8), level_ctgidx = c(3,3,3,4,5))
train_X <- Data_sim2$X[1:400,]; test_X <- Data_sim2$X[401:420,]
train_Y <- Data_sim2$Y[1:400]; test_Y <- Data_sim2$Y[401:420]
fit <- SMLE(Y = train_Y, X = train_X, family = "gaussian", group = TRUE, k = 15)
predict(fit, newdata = test_X)
test_Y

# Example 3:
library(datasets)
data("attitude")
set.seed(1)
noise <- matrix(rnorm(30*100, mean = mean(attitude$rating) , sd = 1), ncol = 100)
colnames(noise) <- paste("Noise", seq(100), sep = ".")
df <- data.frame(cbind(attitude, noise))
fit <- SMLE(rating ~., data = df)
fit




[Package SMLE version 2.1-1 Index]