model_classification {PriceIndices} | R Documentation |
Building the machine learning model for product classification
Description
This function provides a trained machine learning model to classify products into classes or any other groups defined by the user. In addition, the function returns the characteristics of the model and figures describing the learning process.
Usage
model_classification(
data_train = data.frame(),
data_test = data.frame(),
class = c(),
indicators = c(),
key_words = c(),
sensitivity = FALSE,
p = 0.9,
w = 0.2,
rounds = 200,
grid = list()
)
Arguments
data_train |
Training data set for the model. This set must contain all the columns defined by the |
data_test |
A test set that is used to validate the machine learning model. This set should have the same structure as the training set, but it is not obligatory. If the test set is not specified by the user then the test set is drawn from the training set (see |
class |
A character string which indicates the column with classes (groups) of products (e.g. COICOPs). |
indicators |
A vector of column names to be considered in building a machine learning model. Important: the indicated variables can be numeric but also categorical (factor or character types are acceptable). |
key_words |
A vector of keywords or phrases that will be recognized in the |
sensitivity |
A logical parameter that indicates whether lowercase or uppercase letters are to be distinguished when the |
p |
A parameter related to creating the testing set, if it has not been specified by the user. The test set is then created on the basis of a class-balanced subsample of the training set. The size of this subsample is 100p percents of the training set size. |
w |
A parameter for determining the measure of choosing the optimal machine learning model. For each combination of parameters specified in the |
rounds |
The maximum number of iterations during the training stage. |
grid |
The list of vectors of parameters which are taken into consideration during the |
Value
In general, this function provides a trained machine learning model to classify products into classes (or any other groups). In addition, the function returns the characteristics of the model and figures describing the learning process. The machine learning process is based on the XGBoost
algorithm (from the XGBoost
package) which is an implementation of gradient boosted decision trees designed for speed and performance. The function takes into account each combination of model parameters (specified by the grid
list) and provides, inter alia, an optimally trained model (a model that minimizes the error rate calculated on the basis of a fixed value of the w
parameter). After all, the function returns a list of the following objects: model
- the optimally trained model; best_parameters
- a set of parameters of the optimal model; indicators
- a vector of all indicators used; key_words
- a vector of all key words and phrases used; classes
- a dataframe with categorized classes; sensitivity
- a value of the used 'sensitivity' parameter; figure_training
- a plot of the error levels calculated for the training set and the testing set during the learning process of the returned model (error = 1 - accuracy); figure_importance
- a plot of the relative importance of the used indicators.
References
Tianqi Chen and Carlos Guestrin (2016). XGBoost: A Scalable Tree Boosting System. 22nd SIGKDD Conference on Knowledge Discovery and Data Mining.
Examples
my.grid=list(eta=c(0.01,0.02,0.05),subsample=c(0.5,0.8))
data_train<-dplyr::filter(dataCOICOP,dataCOICOP$time<=as.Date("2021-10-01"))
data_test<-dplyr::filter(dataCOICOP,dataCOICOP$time==as.Date("2021-11-01"))
ML<-model_classification(data_train,data_test,class="coicop6",grid=my.grid,
indicators=c("description","codeIN","grammage"),key_words=c("uht"),rounds=60)
ML$best_parameters
ML$indicators
ML$figure_training
ML$figure_importance