feature_selector {creditmodel}R Documentation

Feature Selection Wrapper

Description

feature_selector This function uses four different methods (IV, PSI, correlation, xgboost) in order to select important features.The correlation algorithm must be used with IV.

Usage

feature_selector(
  dat_train,
  dat_test = NULL,
  x_list = NULL,
  target = NULL,
  pos_flag = NULL,
  occur_time = NULL,
  ex_cols = NULL,
  filter = c("IV", "PSI", "XGB", "COR"),
  cv_folds = 1,
  iv_cp = 0.01,
  psi_cp = 0.5,
  xgb_cp = 0,
  cor_cp = 0.98,
  breaks_list = NULL,
  hopper = FALSE,
  vars_name = TRUE,
  parallel = FALSE,
  note = TRUE,
  seed = 46,
  save_data = FALSE,
  file_name = NULL,
  dir_path = tempdir(),
  ...
)

Arguments

dat_train

A data.frame with independent variables and target variable.

dat_test

A data.frame of test data. Default is NULL.

x_list

Names of independent variables.

target

The name of target variable.

pos_flag

The value of positive class of target variable, default: "1".

occur_time

The name of the variable that represents the time at which each observation takes place.

ex_cols

A list of excluded variables. Regular expressions can also be used to match variable names. Default is NULL.

filter

The methods for selecting important and stable variables.

cv_folds

Number of cross-validations. Default: 5.

iv_cp

The minimum threshold of IV. 0 < iv_i ; 0.01 to 0.1 usually work. Default: 0.02

psi_cp

The maximum threshold of PSI. 0 <= psi_i <=1; 0.05 to 0.2 usually work. Default: 0.1

xgb_cp

Threshold of XGB feature's Gain. 0 <= xgb_cp <=1. Default is 1/number of independent variables.

cor_cp

Threshold of correlation between features. 0 <= cor_cp <=1; 0.7 to 0.98 usually work. Default is 0.98.

breaks_list

A table containing a list of splitting points for each independent variable. Default is NULL.

hopper

Logical.Filtering screening. Default is FALSE.

vars_name

Logical, output a list of filtered variables or table with detailed IV and PSI value of each variable. Default is FALSE.

parallel

Logical, parallel computing. Default is FALSE.

note

Logical.Outputs info. Default is TRUE.

seed

Random number seed. Default is 46.

save_data

Logical, save results in locally specified folder. Default is FALSE.

file_name

The name for periodically saved results files. Default is "select_vars".

dir_path

The path for periodically saved results files. Default is "./variable"

...

Other parameters.

Value

A list of selected features

See Also

psi_iv_filter, xgb_filter, gbm_filter

Examples

feature_selector(dat_train = UCICreditCard[1:1000,c(2,8:12,26)],
                      dat_test = NULL, target = "default.payment.next.month",
                      occur_time = "apply_date", filter = c("IV", "PSI"),
                      cv_folds = 1, iv_cp = 0.01, psi_cp = 0.1, xgb_cp = 0, cor_cp = 0.98,
                      vars_name = FALSE,note = FALSE)

[Package creditmodel version 1.3.0 Index]