R: Calculate the permutation distribution for a keyness measure

keyperm {keyperm}

R Documentation

Calculate the permutation distribution for a keyness measure

Description

Calculate the permutation distributions of a given keyness measure for each term by shuffeling the copus labels. Number of documents per corpus is kept constant.

Usage

keyperm(ifl, observed, type = "llr", laplace = 1, output = "counts", nperm)

Arguments

`ifl`	Indexed frequency list as generated by `create_ifl()`.
`observed`	The vector of observed values of the keyness scores as generarted by `keyness_scores()`
`type`	The type of keyness measure. One of `llr`, `chisq`, `diff`, `logratio` or `ratio`. See details.
`laplace`	Parameter of laplace correction. Only relevant for `type = "ratio"` and `type = "logratio"`. See details.
`output`	The type of output. For `output = "full"` a matrix with all generated scores is returned, for `output = "counts"` a matrix with three columns counting the number of permutations for which the score is strictly smaller than, equal to or strictly larger than the observed value.
`nperm`	The number of permutations to generate.

Details

While usually keyness scores are judged by reference to a limiting null distribution under a token-by-token-sampling model, this implementation approximates the null distribution under a document-by-document sampling model. The permutation distributions of a given keyness measure for each term is calculated by repeatedly shuffeling the copus labels. Number of documents per corpus is kept constant.

Currently, the following types of scores are supported:

llr: The log-likelihood ratio
chisq: The Chi-Square-Statistic
diff: Difference of relative frequencies
logratio: Binary logarithm of the ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.
ratio: ratio of the relative frequencies, possibly using a laplace correction to avoid infinite values.

llr and chisq are the test-statistics for a two-by-two contingency table.

	corpus A	corpus B	TOTAL
term of interest	`o_{11}`	`o_{12}`	`r_{1}`
other tokens	`o_{21}`	`o_{22}`	`r_{2}`
TOTAL	`c_{1}`	`c_{2}`	N

Both measure deviations from equal proportions but do not indicate the direction. For llr, the correct version using terms for all four fields of the table is used, not the version using only two terms that is sometimes used in corpus linguistics:

llr = -2 * (o11 * log(o11/e11) + o12 * log(o12/e12) + o21 * log(o21/e21) + o22 * log(o22/e22))

where oij * log(oij/eij) = 0 if oij = 0.

chisq is the usual Chi-Square statistic for a test of independece / homogeneity:

chisq = (o11 - e11)^2/e11 + (o12 - e12)^2/e12 + (o21 - e21)^2/e21 + (o22 - e22)^2/e22

Both llr and chisq asymptotically follow a Chi-Square-Distribution with 1 degree of freedom if the null hypothesis of equal frequencies in both populations is true and the corpora are drawn iid token-by-token. In contrast, In contrast, the p-values calculated here are obtained based on a document-by-document sampling model, which is arguably more realistic in many cases.

Here, oij are the observed counts as given above and eij are the correpsonding expected values under an independence / homogeneity assumption.

diff and logratio are measures of the effect size, but using the permutation approach implemented here a p-value can be calculated as well. Both indicate the direction of the effect, and can be used for one- or two-sided tests.

diff = o11 / c1 - o12 / c2

logratio is based on a ratio of ratios and would be infinite when a term does not occur in either of the two corpora, irrespective of number of occurences in the other corpus. Hence, we use a laplace correction adding a (not neccesarily integer) number k of ficticious occurences to both corpora:

logratio = log2( ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k)) )

where o11 and o12 are the number of occurences of the term of interest in Corpora A and B and c1 and c2 are the total numbers of tokens in A and B. Setting k to zero corresponds to the usual logratio (which may be infinite). k is given by the laplace argument and defaults to one, meaning one ficticious occurence is added to either corpus. Doing so prevents infinite values but has little effect when the number of occurences is large.

ratio is the same as logratio but omits the logarithm:

ratio = ((o11 + k) / (c1 + k)) / ((o12 + k) / (c2 + k))

This leads to the same p-values but is faster to compute.

Value

A numeric matrix with number of rows equal to the number of terms. The columns contain either all permutation values of the keyness score (output = "full") or the number of permutations for which the score is strictly smaller than, equal to or strictly larger than the observed value (output = "counts").

[Package keyperm version 0.1.1 Index]