train_similarity_based_reasoning {occupationMeasurement} | R Documentation |
Train Similarity Based Probability Model with anonymized training data
Description
This function requires the mvtnorm package.
Usage
train_similarity_based_reasoning(
anonymized_data,
num_allowed_codes = 1291,
coding_index_w_codes,
coding_index_without_codes = NULL,
preprocessing = list(stopwords = NULL, stemming = NULL, strPreprocessing = TRUE,
removePunct = FALSE),
dist_type = c("wordwise", "substring", "fulltext"),
dist_control = list(method = "osa", weight = c(d = 1, i = 1, s = 1, t = 1)),
threshold = c(max = 3, use = 1),
simulation_control = list(n.draws = 250, check_normality = FALSE)
)
Arguments
anonymized_data |
|
num_allowed_codes |
the number of allowed codes in the target classification. There are 1286 categories in the KldB 2010 plus 5 special codes in both anonymized training data sets, so the default value is 1291. |
coding_index_w_codes |
a data.table with columns
|
coding_index_without_codes |
(not used, but automatically determined) Any words from |
preprocessing |
a list with elements
|
dist_type |
How to calculate similarity between entries from both coding_indices and verbal answers from the survey? Three options are currently supported. Since we use the
|
dist_control |
If |
threshold |
A numeric vector with two elements. If |
simulation_control |
a list with two components,
|
Value
a list with components
- prediction.datasets$modelProb
Contains all entries from the coding index. dist = "official" if the entry stems from coding_index_w_codes and dist = selfcreated if the entry stems from coding_index_without_codes.
string.prob
is used for weighting purposes (model averaging) if a new verbal answer is similar to multiple strings.unobserved.mean.theta
gives a probability (usually very low) for any category that was not observed in the training data together with this string.- prediction.datasets$categoryProb
mean.theta
is the probability forcode
given that an incoming verbal answer is similar tostring
. Only available if this code was at least a single time observed with this string (Useunobserved.mean.theta
otherwise).- num_allowed_codes
Number of categories in the classification.
- preprocessing
The input parameter stored to replicate preprocessing with incoming data.
- dist_type
The input parameter stored to replicate distance calculations with incoming data.
- dist_control
The input parameter stored to replicate distance calculations with incoming data.
- threshold
The input parameter stored to replicate distance calculations with incoming data.
- simulation_control
The input parameters controlling the Monte Carlo simulation.
References
Schierholz, Malte (2019): New methods for job and occupation classification. Dissertation, Mannheim. https://madoc.bib.uni-mannheim.de/50617/, pp. 206-208 and p. 268, pp. 308-320
https://github.com/malsch/occupationCoding (function trainSimilarityBasedReasoning2 is implemented here)
See Also
pretrained_models, which were created using this function.