keyness_scores {keyperm}R Documentation

Calculate observed keyness scores

Description

Calculates a vector of observed keyness scores for a given pair of corpora.

Usage

keyness_scores(ifl, type = "llr", laplace = 1)

Arguments

ifl

Indexed frequency list as generated by create_ifl().

type

The type of keyness measure. One of llr, chisq, diff, logratio or ratio. See details.

laplace

Parameter of laplace correction. Only relevant for type = "ratio" and type = "logratio". See details.

Details

Keyness scores are calculated for an Indexed frequency list from a given pair of corpora as generated by create_ifl().

Currently, the following types of scores are supported:

llr

The log-likelihood ratio

chisq

The Chi-Square-Statistic

diff

Difference of relative frequencies

logratio

Binary logarithm of the ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.

ratio

ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.

llr and chisq are the test-statistics for a two-by-two contingency table.

corpus A corpus B TOTAL
term of interest o_{11} o_{12} r_{1}
other tokens o_{21} o_{22} r_{2}
TOTAL c_{1} c_{2} N

Both measure deviations from equal proportions but do not indicate the direction. For llr, the correct version using terms for all four fields of the table is used, not the version using only two terms that is sometimes used in corpus linguistics:

llr = -2 * (o11 * log(o11/e11) + o12 * log(o12/e12) + o21 * log(o21/e21) + o22 * log(o22/e22))

where oij * log(oij/eij) = 0 if oij = 0.

chisq is the usual Chi-Square statistic for a test of independece / homogeneity:

chisq = (o11 - e11)^2/e11 + (o12 - e12)^2/e12 + (o21 - e21)^2/e21 + (o22 - e22)^2/e22

Here, oij are the observed counts as given above and eij are the correpsonding expected values under an independence / homogeneity assumption.

diff and logratio are measures of the effect size, but using the permutation approach implemented here a p-value can be calculated as well. Both indicate the direction of the effect, and can be used for one- or two-sided tests.

diff = o11 / c1 - o12 / c2

logratio is based on a ratio of ratios and would be infinite when a term does not occur in either of the two corpora, irrespective of number of occurences in the other corpus. Hence, we use a laplace correction adding a (not neccesarily integer) number k of ficticious occurences to both corpora:

logratio = log2( ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) )

where o11 and o12 are the number of occurences of the term of interest in Corpora A and B and c1 and c2 are the total numbers of tokens in A and B. Setting k to zero corresponds to the usual logratio (which may be infinite). k is given by the laplace argument and defaults to one, meaning one ficticious occurence is added to either corpus. Doing so prevents infinite values but has little effect when the number of occurences is large.

ratio is the same as logratio but omits the logarithm:

ratio = ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k))

This leads to the same p-values but is faster to compute.

Value

a numerical vector of the scores, one for each term. Terms are stored in the names attribute.


[Package keyperm version 0.1.1 Index]