select_features {MetaNLP}R Documentation

Select features via elasticnet regularization

Description

As the word count matrix quickly grows with an increasing number of abstracts, it can easily reach several thousand columns. Thus, it can be important to extract the columns that carry most of the information in the decision making process. This function uses a generalized linear model combined with elasticnet regularization to extract these features. In contrast to a usual regression model or a L2 penalty (ridge regression), elasticnet (and LASSO) sets some regression parameters to 0. Thus, the selected features are exactly the features with a non-zero entry.

Usage

select_features(object, ...)

## S4 method for signature 'MetaNLP'
select_features(object, alpha = 0.8, lambda = "avg", seed = NULL, ...)

Arguments

object

An object of class MetaNLP

...

Additional arguments for cv.glmnet. An important option might be type.measure to specify which loss is used when the cross validation is executed.

alpha

The elastic net mixing parameter, with 0\leq \alpha \leq 1. alpha = 1 then equals the lasso penalty, alpha = 0 is the ridge penalty.

lambda

The weight parameter of the penalty. The possible values are "avg", "min", "1se" or a numeric value which directly determines \lambda. When choosing "avg", "min" or "1se", cross validation is executed to determine \lambda. Note that cross validation uses random folds, so the results are not necessarily replicable. "avg" calls select_features 10 times, computes the \lambda which minimizes the loss for each iteration and then uses the median of these values as the final value, for which the objective function is minimized. "min" and "1se" carry out the cross validation just once and \lambda is either the value, for which the cross-validated error is minimized (option "min") or the value, that gives the most regularized model such that the cross-validated error is within one standar error of the minimum (option "1se").

seed

A numeric value which is used as a local seed for this function. Default is seed = NULL, so no seed is set. Setting a seed leads to replicable results of the cross validation, such that each call of select_features selects the same columns. If a seed is set, the option lambda = "avg" yields the same results as lambda = "min".

Details

The computational aspects are executed by the glmnet package. At first, a model is fitted via glmnet. The elastic net parameter \alpha can be specified by the user. The parameter \lambda, which determines the weight of the penalty, can either be chosen via cross validation (using cv.glmnet or by giving a numeric value.

Value

An object of class MetaNLP, where the columns were selected via elastic net.

Note

By using a fix value for lambda, the number of features which should be selected can easily be adjusted by the parameter alpha. The smaller one chooses alpha, the more columns will still be present in the resulting data frame, the higher one chooses alpha, the less columns will be chosen.

Examples

path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE)
obj <- MetaNLP(path)
obj2 <- select_features(obj, alpha = 0.7, lambda = "min")



[Package MetaNLP version 0.1.2 Index]