data_matching {PriceIndices} | R Documentation |
Matching products
Description
This function returns a data set defined in the first parameter (data
) with an additional column (prodID
). Two products are treated as being matched if they have the same prodID
value.
Usage
data_matching(
data,
start,
end,
interval = FALSE,
variables = c(),
codeIN = TRUE,
codeOUT = TRUE,
description = TRUE,
onlydescription = FALSE,
precision = 0.95
)
Arguments
data |
The user's data frame with information about products to be matched. It must contain columns: |
start |
The base period (as character) limited to the year and month, e.g. "2020-03". |
end |
The research period (as character) limited to the year and month, e.g. "2020-04". |
interval |
A logical value indicating whether the matching process concerns only two periods defined by |
variables |
The optional parameter describing the vector of additional column names. Values of these additional columns must be identical for matched products. |
codeIN |
A logical value, e.g. if there are retailer (internal) product codes (as numeric or character) written in |
codeOUT |
A logical value, e.g. if there are external product codes, such as GTIN or SKU (as numeric or character) written in |
description |
A logical value, e.g. if there are product labels (as character) written in |
onlydescription |
A logical value indicating whether products with identical labels (described in the |
precision |
A threshold value for the Jaro-Winkler similarity measure when comparing labels (its value must belong to the interval [0,1]). Two labels are treated as similar enough if their Jaro-Winkler similarity exceeds the |
Value
This function returns a data set defined in the first parameter (data
) with an additional column (prodID
). Two products are treated as being matched if they have the same prodID
value. The procedure of generating the above-mentioned additional column depends on the set of chosen columns for matching. In most extreme case, when the onlydescription
parameter value is TRUE, two products are also matched if they have identical descriptions. Other cases are as follows: Case 1
: Parameters codeIN
, codeOUT
and description
are set to TRUE. Products with two identical codes or one of the codes identical and an identical description
are automatically matched. Products are also matched if they have identical one of codes and the Jaro-Winkler similarity of their descriptions is bigger than the precision
value.Case 2
: Only one of the parameters: codeIN
or codeOUT
are set to TRUE and also the description
parameter is set to TRUE. Products with an identical chosen code and an identical description are automatically matched. In the second stage, products are also matched if they have an identical chosen code and the Jaro-Winkler similarity of their descriptions is bigger than the precision
value. Case 3
: Parameters codeIN
and codeOUT
are set to TRUE and the parameter description
is set to FALSE. In this case, products are matched if they have both codes identical. Case 4
: Only the parameter description
is set to TRUE. This case requires the onlydescription
parameter to be TRUE and then the matching process is based only on product labels (two products are matched if they have identical descriptions). Case 5
: Only one of the parameters: codeIN
or codeOUT
are set to TRUE and the description
parameter is set to FALSE. In this case, the only reasonable option is to return the prodID
column which is identical with the chosen code column. Please note that if the set of column names defined in the variables
parameter is not empty, then the values of these additional columns must be identical while product matching.
Examples
data_matching(dataMATCH, start="2018-12",end="2019-02",onlydescription=TRUE,interval=TRUE)
data_matching(dataMATCH, start="2018-12",end="2019-02",precision=0.98, interval=TRUE)