keyness {corpora} | R Documentation |
Compute best-practice keyness measures (corpora)
Description
Compute best-practice keyness measures (according to Evert 2022) for the frequency comparison of lexical items in two corpora. The function is fully vectorised and should be applied to a complete data set of candidate items (so statistical analysis can be adjusted to control the family-wise error rate).
Usage
keyness(f1, n1, f2, n2, measure=c("LRC", "PositiveLRC", "G2", "LogRatio", "SimpleMaths"),
conf.level=.95, alpha=NULL, p.adjust=TRUE, lambda=1)
Arguments
f1 |
a numeric vector specifying the frequencies of candidate items in corpus A (target corpus) |
n1 |
sample size of target corpus, i.e. the total number of tokens in corpus A (usually a scalar, but can also be a vector parallel to |
f2 |
a numeric vector parallel to |
n2 |
sample size of reference corpus, i.e. the total number of tokens in corpus B (usually a scalar, but can also be a vector parallel to |
measure |
the keyness measure to be computed (see “Details” below) |
conf.level |
the desired confidence level for the |
alpha |
if specified, filter out candidate items whose frequency difference between |
p.adjust |
if |
lambda |
parameter |
Details
This function computes a range of best-practice keyness measures comparing the relative frequencies
\pi_1
and \pi_2
of lexical items in populations (i.e. sublanguages) A and B,
based on the observed sample frequencies f_1, f_2
and the corresponding sample sizes n_1, n_2
.
The function is fully vectorised with respect to arguments f1
, f2
, n1
and n2
,
but only a single keyness measure can be selected for each function call.
All implemented measures are robust for the corner cases f_1 = 0
and f_2 = 0
, but f_1 = f_2 = 0
is not allowed.
Most of the keyness measures are directional,
i.e. positive scores indicate positive keyness in A (\pi_1 > \pi_2
)
and negative scores indicate negative keyness in A (\pi_1 < \pi_2
).
By contrast, the one-sided measures PositiveLRC
and SimpleMaths
only detect positive keyness in A,
returning small (and possibly negative) scores otherwise, i.e. in case of insufficient evidence for \pi_1 > \pi_2
and in case of strong evidence for \pi_1 < \pi_2
.
One-sided measures can be useful for a ranking of the entire data set as positive keyword candidates.
Hardie (2014) and other authors recommend to combine effect-size measures (in particular LogRatio
) with
a significance filter in order to weed out candidate items for which there is no significant evidence
against the null hypothesis H_0: \pi_1 = \pi_2
. Such a filter is activated by specifying the desired
significance level alpha
, and can be combined with all keyness measures.
In this case, the scores of all non-significant candidate items are set to 0.
The decision is based in the likelihood-ratio test implemented by the G2
measure
and its asymptotic \chi^2_1
distribution under H_0
.
Note that the significance filter can also be applied to the G2
measure itself, setting all scores
below the critical value for the significance test to 0.
For one-sided measures (PositiveLRC
and SimpleMaths
), candidates with significant evidence
for negative keyness are also filtered out (i.e. their scores are set to 0) in order to ensure a consistent ranking.
By default, statistical inference corrects for multiple testing in order to control family-wise error rates.
This applies to the significance filter as well as to the confidence intervals underlying LRC
and PositiveLRC
.
Note that the G2
scores themselves are never adjusted (only the critical value for the significance filter is modified).
Family size m
is automatically determined from the number of candidate items processed in a single function call.
Alternatively, the family size can be specified explicitly in the p.adjust
argument, e.g. if a large data set
is processed in multiple batches, or p.adjust=FALSE
can be used to disable the correction.
For the adjustment, a highly conservative Bonferroni correction \alpha' = \alpha / m
is applied to significance levels.
Since the large candidate sets and sample sizes often found in corpus linguistics tend to produce large numbers of false positives,
this conservative approach is considered to be useful.
See Evert (2022) and its supplementary materials for a more detailed discussion of the implemented best-practice measures and some alternatives.
Keyness Measures
G2
-
The log-likelihood measure (Rayson & Garside 2003: 3) computes the score
G^2
of a likelihood-ratio test forH_0: \pi_1 = \pi_2
. This test is two-sided and always returns positive values, so the sign of its score is inverted forf_1 / n_1 < f_2 / n_2
in order to obtain a directional keyness measure. As a pure significance measure, it tends to prefer high-frequency candidates with largef_1
. LogRatio
-
A point estimate of the log relative risk
\log_2 (\pi_1 / \pi_2)
, which has been suggested as an intuitive keyness measure under the name LogRatio by Hardie (2014: 45). The implementation uses Walter's (1975) adjusted estimator% \log_2 \dfrac{f_1 + \frac12}{n_1 + \frac12} - \log_2 \dfrac{f_2 + \frac12}{n_2 + \frac12}
which is less biased and robust against
f_i = 0
. As a pure effect-size measure, LogRatio tends to assign spuriously high scores to low-frequency candidates with smallf_1
andf_2
(due to sampling variation). Combination with a significance filter is strongly recommended. LRC
(default)-
A conservative estimate for LogRatio recommended by Evert (2022) in order to combine and balance the advantages of effect-size and significance measures. A confidence interval (according to the specified
conf.level
) for relative riskr = \pi_1 / \pi_2
is obtained from an exact conditional Poisson test (Fay 2010: 55), adjusted for multiple testing by default. If a candidate is not significant (i.e. the confidence interval includesH_0: r = 1
) its score is set to 0. Otherwise the boundary of the confidence interval closer to 1 is taken as a conservative directional estimate ofr
and its\log_2
is returned. PositiveLRC
-
A one-sided variant of LRC, which returns the lower boundary of a one-sided confidence interval for
\log_2 r
. Scores\leq 0
indicate that there is no significant evidence for positive keyness. The directional version of LRC is recommended for general use, but PositiveLRC may be preferred if the hermeneutic interpretation should also consider non-significant candidates (especially with small data sets). SimpleMaths
-
The simple maths keyness measure (Kilgarriff 2009) used by the commercial corpus analysis platform Sketch Engine:
\dfrac{10^6 \cdot \frac{f_1}{n_1} + \lambda}{10^6 \cdot \frac{f_2}{n_2} + \lambda}
Its frequency bias can be adjusted with the user parameter
\lambda > 0
. The scaling factor10^6
was chosen so that\lambda = 1
is a practical default value.There does not appear to be a convincing mathematical justification behind this measure. It is included here only because of the popularity of the Sketch Engine platform.
Value
A numeric vector of the same length as f1
and f2
, containing keyness scores for all candidate lexical items.
For most measures, positive scores indicate positive keywords (i.e. higher frequency in the population underlying corpus A)
and negative scores indicate negative keywords (i.e. higher frequency in the population underlying corpus B).
If alpha
is specified, non-significant candidates always have a score of 0.
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
References
Evert, S. (2022). Measuring keyness. In Digital Humanities 2022: Conference Abstracts, pages 202-205, Tokyo, Japan / online. https://osf.io/cy6mw/
Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.
Hardie, A. (2014). A single statistical technique for keywords, lockwords, and collocations. Internal CASS working paper no. 1, unpublished.
Kilgarriff, A. (2009). Simple maths for keywords. In Proceedings of the Corpus Linguistics 2009 Conference, Liverpool, UK.
Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pages 1-6, Hong Kong.
Walter, S. D. (1975). The distribution of Levin’s measure of attributable risk. Biometrika, 62(2): 371-374.
See Also
prop.cint
, which is used by the exact conditional Poisson test of the LRC measure
Examples
# compute all keyness measures for a single candidate item with f1=7, f2=2 and n1=n2=1000
keyness(7, 1000, 2, 1000, measure="G2") # log-likelihood
keyness(7, 1000, 2, 1000, measure="LogRatio")
keyness(7, 1000, 2, 1000, measure="LogRatio", alpha=0.05) # with significance filter
keyness(7, 1000, 2, 1000, measure="LRC") # the default measure
keyness(7, 1000, 2, 1000, measure="PositiveLRC")
keyness(7, 1000, 2, 1000, measure="SimpleMaths")
# a practical example: keywords of spoken British English (from BNC corpus)
n1 <- sum(BNCcomparison$spoken) # sample sizes
n2 <- sum(BNCcomparison$written)
kw <- transform(BNCcomparison,
G2 = keyness(spoken, n1, written, n2, measure="G2"),
LogRatio = keyness(spoken, n1, written, n2, measure="LogRatio"),
LRC = keyness(spoken, n1, written, n2))
kw <- kw[order(-kw$LogRatio), ]
head(kw, 20)
# collocations of "in charge of" with LRC as an association measure
colloc <- transform(BNCInChargeOf,
PosLRC = keyness(f.in, N.in, f.out, N.out, measure="PositiveLRC"))
colloc <- colloc[order(-colloc$PosLRC), ]
head(colloc, 30)