keyperm {keyperm} | R Documentation |
Calculate the permutation distribution for a keyness measure
Description
Calculate the permutation distributions of a given keyness measure for each term by shuffeling the copus labels. Number of documents per corpus is kept constant.
Usage
keyperm(ifl, observed, type = "llr", laplace = 1, output = "counts", nperm)
Arguments
ifl |
Indexed frequency list as generated by |
observed |
The vector of observed values of the keyness scores as generarted by |
type |
The type of keyness measure. One of |
laplace |
Parameter of laplace correction. Only relevant for |
output |
The type of output. For |
nperm |
The number of permutations to generate. |
Details
While usually keyness scores are judged by reference to a limiting null distribution under a token-by-token-sampling model, this implementation approximates the null distribution under a document-by-document sampling model. The permutation distributions of a given keyness measure for each term is calculated by repeatedly shuffeling the copus labels. Number of documents per corpus is kept constant.
Currently, the following types of scores are supported:
llr
The log-likelihood ratio
chisq
The Chi-Square-Statistic
diff
Difference of relative frequencies
logratio
Binary logarithm of the ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.
ratio
ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.
llr
and chisq
are the test-statistics for a two-by-two contingency table.
corpus A | corpus B | TOTAL | |
term of interest | o_{11} | o_{12} | r_{1} |
other tokens | o_{21} | o_{22} | r_{2} |
TOTAL | c_{1} | c_{2} | N |
Both measure deviations from equal proportions but do not indicate the direction.
For llr
, the correct version using terms for all four fields of the table is used,
not the version using only two terms that is sometimes used in corpus linguistics:
llr = -2 * (o11 * log(o11/e11) + o12 * log(o12/e12) +
o21 * log(o21/e21) + o22 * log(o22/e22))
where oij * log(oij/eij) = 0
if oij = 0
.
chisq
is the usual Chi-Square statistic for a test of independece / homogeneity:
chisq = (o11 - e11)^2/e11 + (o12 - e12)^2/e12 +
(o21 - e21)^2/e21 + (o22 - e22)^2/e22
Both llr
and chisq
asymptotically follow a Chi-Square-Distribution
with 1 degree of freedom if the null hypothesis of equal frequencies in both
populations is true and the corpora are drawn iid token-by-token. In contrast,
In contrast, the p-values calculated here are obtained based on a document-by-document
sampling model, which is arguably more realistic in many cases.
Here, oij
are the observed counts as given above and eij
are the correpsonding expected values under an independence / homogeneity assumption.
diff
and logratio
are measures of the effect size,
but using the permutation approach implemented here a p-value can
be calculated as well. Both indicate the direction of the effect,
and can be used for one- or two-sided tests.
diff = o11 / c1 - o12 / c2
logratio
is based on a ratio of ratios and would be infinite when a term does not occur in either of the two corpora, irrespective of number of occurences in the other corpus. Hence, we use a laplace correction adding a (not neccesarily integer) number k
of ficticious occurences to both corpora:
logratio = log2( ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) )
where o11
and o12
are the number of occurences of the term of interest in Corpora A and B
and c1
and c2
are the total numbers of tokens in A and B.
Setting k
to zero corresponds to the usual logratio (which may be
infinite). k
is given by the laplace
argument and
defaults to one, meaning one ficticious occurence is added to
either corpus. Doing so prevents infinite values but has little
effect when the number of occurences is large.
ratio
is the same as logratio
but omits the logarithm:
ratio = ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k))
This leads to the same p-values but is faster to compute.
Value
A numeric matrix with number of rows equal to the number of terms. The columns contain either all permutation values
of the keyness score (output = "full"
) or the number of permutations for which the
score is strictly smaller than, equal to or strictly larger than the observed value (output = "counts"
).