classify_occupation {labourR} | R Documentation |
Classify occupations
Description
This function takes advantage of the hierarchical structure of the ESCO-ISCO mapping and matches multilingual free-text with the ESCO occupations vocabulary in order to map semi-structured vacancy data into the official ESCO-ISCO classification.
Usage
classify_occupation(
corpus,
id_col = "id",
text_col = "text",
lang = "en",
num_leaves = 10,
isco_level = 3,
max_dist = 0.1,
string_dist = NULL
)
Arguments
corpus |
A data.frame or a data.table that contains the id and the text variables. |
id_col |
The name of the id variable. |
text_col |
The name of the text variable. |
lang |
The language that the text is in. |
num_leaves |
The number of occupations/neighbors that are kept when matching. |
isco_level |
The ISCO level of the suggested occupations. Can be either 1, 2, 3, 4 for ISCO occupations, or NULL that returns ESCO occupations. |
max_dist |
String distance used for fuzzy matching. The |
string_dist |
String dissimilarity measurement. Available string distance metrics: |
Details
First, the input text is cleansed and tokenized. The tokens are then matched with the ESCO occupations vocabulary, created from
the preferred and alternative labels of the occupations. They are joined with the tfidf
weighted tokens of the ESCO occupations and the sum of the tf-idf score is used to retrieve the suggested ontologies. Technically speaking, the
suggested ESCO occupations are retrieved by solving the optimization problem,
\arg\max_d\left\{\vec{u}_{binary}\cdot \vec{u}_d\right\}
where, \vec{u}_{binary}
stands for the binary representation of a query to the ESCO-vocabulary space,
while, \vec{u}_d
is the ESCO occupation normalized vector generated by the tf-idf numerical statistic.
If an ISCO level is specified, the k-nearest neighbors algorithm is used to determine the suggested occupation, classified by a plurality vote in the corresponding hierarchical level of its neighbors.
Before the suggestions are returned, the preferred label of each suggested occupation is added to the result, using the
occupations_bundle
and isco_occupations_bundle
as look-up tables.
Value
Either a data.table with the id, the preferred label and the suggested ESCO occupation URIs (num_leaves predictions for each id), or a data.table with the id, the preferred label and the suggested ISCO group of the inputted level (one for each id).
References
M.P.J. van der Loo (2014). The stringdist package for approximate string matching. R Journal 6(1) pp 111-122.
Gweon, H., Schonlau, M., Kaczmirek, L., Blohm, M., & Steiner, S. (2017). Three Methods for Occupation Coding Based on Statistical Learning, Journal of Official Statistics, 33(1), 101-122.
Arthur Turrell, Bradley J. Speigner, Jyldyz Djumalieva, David Copple, James Thurgood (2019). Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings.
ESCO Service Platform - The ESCO Data Model documentation
Examples
corpus <- data.frame(
id = 1:3,
text = c(
"Junior Architect Engineer",
"Cashier at McDonald's",
"Priest at St. Martin Catholic Church"
)
)
classify_occupation(corpus = corpus, isco_level = 3, lang = "en", num_leaves = 5)