R: Building the machine learning model for product...

model_classification {PriceIndices}

R Documentation

Building the machine learning model for product classification

Description

This function provides a trained machine learning model to classify products into classes or any other groups defined by the user. In addition, the function returns the characteristics of the model and figures describing the learning process.

Usage

model_classification(
  data_train = data.frame(),
  data_test = data.frame(),
  class = c(),
  indicators = c(),
  key_words = c(),
  sensitivity = FALSE,
  p = 0.9,
  w = 0.2,
  rounds = 200,
  grid = list()
)

Arguments

`data_train`	Training data set for the model. This set must contain all the columns defined by the `indicators` parameter and the `class` column. If the `key_words` vector is non-empty, the set should also contain a `description` column. Ideally, the indicators should be of the numerical type. If the indicator is not of the numerical type, it will be converted to this type.
`data_test`	A test set that is used to validate the machine learning model. This set should have the same structure as the training set, but it is not obligatory. If the test set is not specified by the user then the test set is drawn from the training set (see `p` parameter).
`class`	A character string which indicates the column with classes (groups) of products (e.g. COICOPs).
`indicators`	A vector of column names to be considered in building a machine learning model. Important: the indicated variables can be numeric but also categorical (factor or character types are acceptable).
`key_words`	A vector of keywords or phrases that will be recognized in the `description` column. For each such keyword and or phrase, a new binary variable (column) will be created and included in the machine model training process.
`sensitivity`	A logical parameter that indicates whether lowercase or uppercase letters are to be distinguished when the `key_words` vector is not empty.
`p`	A parameter related to creating the testing set, if it has not been specified by the user. The test set is then created on the basis of a class-balanced subsample of the training set. The size of this subsample is 100p percents of the training set size.
`w`	A parameter for determining the measure of choosing the optimal machine learning model. For each combination of parameters specified in the `grid` list, the error rate of the trained model is calculated on the basis of the error on the training set (error_L=1-accuracy_L) and the error on the testing set (error_T=1-accuracy_T). Final accuracy of the model is estimated as: `w accuracy_L + (1-w) accuracy_T`.
`rounds`	The maximum number of iterations during the training stage.
`grid`	The list of vectors of parameters which are taken into consideration during the `Extreme Gradient Boosting training`. The default value of this list is as follows: `grid=list(eta=c(0.05,0.1,0.2),max_depth=c(6),min_child_weight=c(1),max_delta_step=c(0),subsample=c(1),gamma=c(0),lambda=c(1),alpha=c(0)`. The complete list of parameters for the used `Tree Booster` is available online here.

Value

In general, this function provides a trained machine learning model to classify products into classes (or any other groups). In addition, the function returns the characteristics of the model and figures describing the learning process. The machine learning process is based on the XGBoost algorithm (from the XGBoost package) which is an implementation of gradient boosted decision trees designed for speed and performance. The function takes into account each combination of model parameters (specified by the grid list) and provides, inter alia, an optimally trained model (a model that minimizes the error rate calculated on the basis of a fixed value of the w parameter). After all, the function returns a list of the following objects: model - the optimally trained model; best_parameters - a set of parameters of the optimal model; indicators - a vector of all indicators used; key_words - a vector of all key words and phrases used; classes - a dataframe with categorized classes; sensitivity - a value of the used 'sensitivity' parameter; figure_training - a plot of the error levels calculated for the training set and the testing set during the learning process of the returned model (error = 1 - accuracy); figure_importance - a plot of the relative importance of the used indicators.

References

Tianqi Chen and Carlos Guestrin (2016). XGBoost: A Scalable Tree Boosting System. 22nd SIGKDD Conference on Knowledge Discovery and Data Mining.

Examples

my.grid=list(eta=c(0.01,0.02,0.05),subsample=c(0.5,0.8))
data_train<-dplyr::filter(dataCOICOP,dataCOICOP$time<=as.Date("2021-10-01"))
data_test<-dplyr::filter(dataCOICOP,dataCOICOP$time==as.Date("2021-11-01"))
ML<-model_classification(data_train,data_test,class="coicop6",grid=my.grid,
indicators=c("description","codeIN","grammage"),key_words=c("uht"),rounds=60)
ML$best_parameters
ML$indicators
ML$figure_training 
ML$figure_importance

[Package PriceIndices version 0.1.9 Index]