assoc_scores {mclm}R Documentation

Association scores used in collocation analysis and keyword analysis

Description

assoc_scores and assoc_abcd take as their arguments co-occurrence frequencies of a number of items and return a range of association scores used in collocation analysis, collostruction analysis and keyword analysis.

Usage

assoc_scores(
  x,
  y = NULL,
  min_freq = 3,
  measures = NULL,
  with_variants = FALSE,
  show_dots = FALSE,
  p_fisher_2 = FALSE,
  haldane = TRUE,
  small_pos = 1e-05
)

assoc_abcd(
  a,
  b,
  c,
  d,
  types = NULL,
  measures = NULL,
  with_variants = FALSE,
  show_dots = FALSE,
  p_fisher_2 = FALSE,
  haldane = TRUE,
  small_pos = 1e-05
)

Arguments

x

Either an object of class freqlist or an object of class cooc_info.

If x is a freqlist, it is interpreted as the target frequency list (i.e. the list with the frequency of items in the target context) and y must be a freqlist with the frequency of items in the reference context.

If x is an object of class cooc_info instead, it is interpreted as containing target frequency information, reference frequency information and corpus size information.

y

An object of class freqlist with the frequencies of the reference context if x is also a freqlist. If x is an object of class cooc_info, this argument is ignored.

min_freq

Minimum value for a[[i]] (or for the frequency of an item in the target frequency list) needed for its corresponding item to be included in the output.

measures

Character vector containing the association measures (or related quantities) for which scores are requested. Supported measure names (and related quantities) are described in Value below.

If measures is NULL, it is interpreted as short for the default selection, i.e. c("exp_a", "DP_rows", "RR_rows", "OR", "MS", "Dice", "PMI", "chi2_signed", "G_signed", "t", "fisher").

If measures is "ALL", all supported measures are calculated (but not necessarily all the variants; see with_variants).

with_variants

Logical. Whether, for the requested measures, all variants should be included in the output (TRUE) or only the main version (FALSE). See also p_fisher_2.

show_dots

Logical. Whether a dot should be shown in console each time calculations for a measure are finished.

p_fisher_2

Logical. only relevant if "fisher" is included in measures. If TRUE, the p-value for a two-sided test (testing for either attraction or repulsion) is also calculated. By default, only the (computationally less demanding) p-value for a one-sided test is calculated. See Value for more details.

haldane

Logical. Should the Haldane-Anscombe correction be used? (See the Details section.)

If haldane is TRUE, and there is at least one zero frequency in a contingency table, the correction is used for all measures calculated for that table, not just for measures that need this to be done.

small_pos

Alternative (but sometimes inferior) approach to dealing with zero frequencies, compared to haldane. The argument small_pos only applies when haldane is set to FALSE. (See the Details section.)

If haldane is FALSE, and there is at least one zero frequency in a contingency table, adding small positive values to the zero frequency cells is done systematically for all measures calculated for that table, not just for measures that need this to be done.

a

Numeric vector expressing how many times some tested item occurs in the target context. More specifically, a[[i]], with i an integer, expresses how many times the i-th tested item occurs in the target context.

b

Numeric vector expressing how many times other items than the tested item occur in the target context. More specifically, b[[i]], with i an integer, expresses how many times other items than the i-th tested item occur in the target context.

c

Numeric vector expressing how many times some tested item occurs in the reference context. More specifically, c[[i]], with i an integer, expresses how many times the i-th tested item occurs in the reference context.

d

Numeric vector expressing how many times items other than the tested item occur in the reference context. More specifically, d[[i]], with i an integer, expresses how many times other items than the i-th tested item occur in the reference context.

types

A character vector containing the names of the linguistic items of which the association scores are to be calculated, or NULL. If NULL, assoc_abcd() creates dummy types such as "t001", "t002", etc.

Details

Input and output

assoc_scores() takes as its arguments a target frequency list and a reference frequency lists (either as two freqlist objects or as a cooc_info object) and returns a number of popular measures expressing, for (almost) every item in either one of these lists, the extent to which the item is attracted to the target context, when compared to the reference context. The "almost" is added between parentheses because, with the default settings, some items are automatically excluded from the output (see min_freq).

assoc_abcd() takes as its arguments four vectors a, b, c, and d, of equal length. Each tuple of values ⁠(a[i], b[i], c[i], d[i])⁠, with i some integer number between 1 and the length of the vectors, is assumed to represent the four numbers a, b, c, d in a contingency table of the type:

tested item any other item total
target context a b m
reference context c d n
total k l N

In the above table m, n, k, l and N are marginal frequencies. More specifically, m = a + b, n = c + d, k = a + c, l = b + d and N = m + n.

Dealing with zeros

Several of the association measures break down when one or more of the values a, b, c, and d are zero (for instance, because this would lead to division by zero or taking the log of zero). This can be dealt with in different ways, such as the Haldane-Anscombe correction.

Strictly speaking, Haldane-Anscombe correction specifically applies to the context of (log) odds ratios for two-by-two tables and boils down to adding 0.5 to each of the four values a, b, c, and d in every two-by-two contingency table for which the original values a, b, c, and d would not allow us to calculate the (log) odds ratio, which happens when one (or more than one) of the four cells is zero. Using the Haldane-Anscombe correction, the (log) odds ratio is then calculated on the bases of these 'corrected' values for a, b, c, and d.

However, because other measures that do not compute (log) odds ratios might also break down when some value is zero, all measures will be computed on the 'corrected' contingency matrix.

If the haldane argument is set to FALSE, division by zero or taking the log of zero is avoided by systematically adding a small positive value to all zero values for a, b, c, and d. The argument small_pos determines which small positive value is added in such cases. Its default value is 0.00001.

Value

An object of class assoc_scores. This is a kind of data frame with as its rows all items from either the target frequency list or the reference frequency list with a frequency larger than min_freq in the target list, and as its columns a range of measures that express the extent to which the items are attracted to the target context (when compared to the reference context). Some columns don't contain actual measures but rather additional information that is useful for interpreting other measures.

Possible columns

The following sections describe the (possible) columns in the output. All of these measures are reported if measures is set to "ALL". Alternatively, each measure can be requested by specifying its name in a character vector given to the measures argument. Exceptions are described in the sections below.

Observed and expected frequencies
Effect size measures

Some of these measures are based on proportions and can therefore be computed either on the rows or on the columns of the contingency table. Each measure can be requested on its own, but pairs of measures can also be requested with the first part of their name, as indicated in their corresponding descriptions.

Other measures use the contingency table in a different way and therefore don't have a complementary row/column pair. In order to retrieve these columns, if measures is not "ALL", their name must be in the measures vector. Some of them are included by default, i.e. if measures is NULL.

Strength of evidence measures

The first measures in this section tend to come in triples: a test statistic, its p-value (preceded by p_) and its signed version (followed by ⁠_signed⁠). The test statistics indicate evidence of either attraction or repulsion. Thus, in order to indicate the direction of the relationship, a negative sign is added in the "signed" version when \frac{a}{k} < \frac{c}{l}.

In each of these cases, the name of the main measure (e.g. "chi2") and/or its signed counterpart (e.g. "chi2_signed") must be in the measures argument, or measures must be "ALL", for the columns to be included in the output. If the main function is requested, the signed counterpart will also be included, but if only the signed counterpart is requested, the non-signed version will be excluded. For the p-value to be retrieved, either the main measure or its signed version must be requested and, additionally, the with_variants argument must be set to TRUE.

The final two groups of measures take a different shape. The ⁠_as_chisq1⁠ columns compute qchisq(1 - p, 1), with p being the p-values they are transforming, i.e. the p right quantile in a \chi^2 distribution with one degree of freedom (see p_to_chisq1()).

Properties of the class

An object of class assoc_scores has:

An object of this class can be saved to file with write_assoc() and read with read_assoc().

Examples

assoc_abcd(10 , 200, 100,  300, types = "four")
assoc_abcd(30, 1000,  14, 5000, types = "fictitious")
assoc_abcd(15, 5000,  16, 1000, types = "toy")
assoc_abcd( 1,  300,   4, 6000, types = "examples")

a <- c(10,    30,    15,    1)
b <- c(200, 1000,  5000,  300)
c <- c(100,   14,    16,    4)
d <- c(300, 5000, 10000, 6000)
types <- c("four", "fictitious", "toy", "examples")
(scores <- assoc_abcd(a, b, c, d, types = types))

as_data_frame(scores)
as_tibble(scores)

print(scores, sort_order = "PMI")
print(scores, sort_order = "alpha")
print(scores, sort_order = "none")
print(scores, sort_order = "nonsense")

print(scores, sort_order = "PMI",
      keep_cols = c("a", "exp_a", "PMI", "G_signed"))
print(scores, sort_order = "PMI",
      keep_cols = c("a", "b", "c", "d", "exp_a", "G_signed"))
print(scores, sort_order = "PMI",
     drop_cols = c("a", "b", "c", "d", "exp_a", "G_signed",
                    "RR_rows", "chi2_signed", "t"))

[Package mclm version 0.2.7 Index]