| predefined_tests {diyar} | R Documentation |
Predefined logical tests in diyar
Description
A collection of predefined logical tests used with sub_criteria objects
Usage
exact_match(x, y)
range_match(x, y, range = 10)
prob_link(
x,
y,
cmp_func,
attr_threshold,
score_threshold,
probabilistic,
return_weights = FALSE
)
true(x, y)
false(x, y)
Arguments
x |
Attribute(s) to be compared against. |
y |
Attribute(s) to be compared by. |
range |
Difference between |
cmp_func |
Logical tests such as string comparators. See |
attr_threshold |
Matching set of weight thresholds for each result of |
score_threshold |
Score threshold determining matched or linked records. See |
probabilistic |
If |
return_weights |
If |
Details
exact_match() - test that x == y
range_match() - test that x \le y \le (x + range)
prob_link() - Test that a record-pair relate to the same entity based on Fellegi and Sunter (1969) model for deciding if two records belong to the same entity.
In summary, record-pairs are created and categorised as matches and non-matches (attr_threshold) with user-defined functions (cmp_func).
If probabilistic is TRUE, two probabilities (m and u) are used to calculate weights for matches and non-matches.
The m-probability is the probability that matched records are actually from the same entity i.e. a true match,
while u-probability is the probability that matched records are not from the same entity i.e. a false match.
Record-pairs whose total score are above a certain threshold (score_threshold) are assumed to belong to the same entity.
Agreement (match) and disagreement (non-match) scores are calculated as described by Asher et al. (2020).
For each record pair, an agreement for attribute i is calculated as;
\log_{2}(m_{i}/u_{i})
For each record pair, a disagreement score for attribute i is calculated as;
\log_{2}((1-m_{i})/(1-u_{i}))
where m_{i} and u_{i} are the m and u-probabilities for each value of attribute i.
Note that each probability is calculated as a combined probability for the record pair.
For example, if the values of the record-pair have u-probabilities of 0.1 and 0.2 respectively,
then the u-probability for the pair will be 0.02.
Missing data (NA) are considered non-matches and assigned a u-probability of 0.
Examples
`exact_match`
exact_match(x = 1, y = 1)
exact_match(x = 1, y = 2)
`range_match`
range_match(x = 10, y = 16, range = 6)
range_match(x = 16, y = 10, range = 6)