| assoc_scores {mclm} | R Documentation |
Association scores used in collocation analysis and keyword analysis
Description
assoc_scores and assoc_abcd take as their arguments co-occurrence
frequencies of a number of items and return a range of association scores used
in collocation analysis, collostruction analysis and keyword analysis.
Usage
assoc_scores(
x,
y = NULL,
min_freq = 3,
measures = NULL,
with_variants = FALSE,
show_dots = FALSE,
p_fisher_2 = FALSE,
haldane = TRUE,
small_pos = 1e-05
)
assoc_abcd(
a,
b,
c,
d,
types = NULL,
measures = NULL,
with_variants = FALSE,
show_dots = FALSE,
p_fisher_2 = FALSE,
haldane = TRUE,
small_pos = 1e-05
)
Arguments
x |
Either an object of class If If |
y |
An object of class |
min_freq |
Minimum value for |
measures |
Character vector containing the association measures (or related
quantities) for which scores are requested. Supported measure names (and
related quantities) are described in If If |
with_variants |
Logical. Whether, for the requested |
show_dots |
Logical. Whether a dot should be shown in console each time calculations for a measure are finished. |
p_fisher_2 |
Logical. only relevant if |
haldane |
Logical. Should the Haldane-Anscombe correction be used? (See the Details section.) If |
small_pos |
Alternative (but sometimes inferior) approach to dealing with
zero frequencies, compared to If |
a |
Numeric vector expressing how many times some tested item
occurs in the target context.
More specifically, |
b |
Numeric vector expressing how many times other items than the tested
item occur in the target context.
More specifically, |
c |
Numeric vector expressing how many times some tested
item occurs in the reference context.
More specifically, |
d |
Numeric vector expressing how many times items other than the tested
item occur in the reference context.
More specifically, |
types |
A character vector containing the names of the linguistic items
of which the association scores are to be calculated, or |
Details
Input and output
assoc_scores() takes as its arguments a target frequency list and a reference
frequency lists (either as two freqlist objects or as a
cooc_info object) and returns a number of popular measures
expressing, for (almost) every item in either one of these lists, the extent
to which the item is attracted to the target context, when compared to the
reference context. The "almost" is added between parentheses because, with
the default settings, some items are automatically excluded from the output
(see min_freq).
assoc_abcd() takes as its arguments four vectors a, b, c, and d, of
equal length. Each tuple of values (a[i], b[i], c[i], d[i]), with i some
integer number between 1 and the length of the vectors, is assumed to represent
the four numbers a, b, c, d in a contingency table of the type:
| tested item | any other item | total | |
| target context | a | b | m |
| reference context | c | d | n |
| total | k | l | N |
In the above table m, n, k, l and N are marginal frequencies. More specifically, m = a + b, n = c + d, k = a + c, l = b + d and N = m + n.
Dealing with zeros
Several of the association measures break down when one or more of the values
a, b, c, and d are zero (for instance, because this would lead to
division by zero or taking the log of zero). This can be dealt with in different
ways, such as the Haldane-Anscombe correction.
Strictly speaking, Haldane-Anscombe correction specifically applies to the
context of (log) odds ratios for two-by-two tables and boils down to adding
0.5 to each of the four values a, b, c, and d
in every two-by-two contingency table for which the original values
a, b, c, and d would not allow us to calculate
the (log) odds ratio, which happens when one (or more than one) of the four
cells is zero.
Using the Haldane-Anscombe correction, the (log) odds ratio is then calculated
on the bases of these 'corrected' values for a, b, c, and d.
However, because other measures that do not compute (log) odds ratios might also break down when some value is zero, all measures will be computed on the 'corrected' contingency matrix.
If the haldane argument is set to FALSE, division by zero or taking the
log of zero is avoided by systematically adding a small positive value to all
zero values for a, b, c, and d. The argument small_pos
determines which small positive value is added in such cases. Its default value is 0.00001.
Value
An object of class assoc_scores. This is a kind of data frame with
as its rows all items from either the target frequency list or the reference
frequency list with a frequency larger than min_freq in the target list,
and as its columns a range of measures that express the extent to which
the items are attracted to the target context (when compared to the reference
context).
Some columns don't contain actual measures but rather additional information
that is useful for interpreting other measures.
Possible columns
The following sections describe the (possible) columns in the output. All
of these measures are reported if measures is set to "ALL". Alternatively,
each measure can be requested by specifying its name in a character vector
given to the measures argument. Exceptions are described in the sections
below.
Observed and expected frequencies
-
a,b,c,d: The frequencies in cells a, b, c and d, respectively. If one of them is0, they will be augmented by 0.5 orsmall_pos(seeDetails). These output columns are always present. -
dir: The direction of the association:1in case of relative attraction between the tested item and the target context (if\frac{a}{m} \ge \frac{c}{n}) and-1in case of relative repulsion between the target item and the target context (if\frac{a}{m} < {c}{n}). -
exp_a,exp_b,exp_c,exp_d: The expected values for cells a, b, c and d, respectively. All these columns will be included if"expected"is inmeasures.exp_ais also one of the default measures and is therefore included ifmeasuresisNULL. The values of these columns are computed as follows:-
exp_a=\frac{m \times k}{N} -
exp_b=\frac{m \times l}{N} -
exp_c=\frac{n \times k}{N} -
exp_d=\frac{n \times l}{N}
-
Effect size measures
Some of these measures are based on proportions and can therefore be computed either on the rows or on the columns of the contingency table. Each measure can be requested on its own, but pairs of measures can also be requested with the first part of their name, as indicated in their corresponding descriptions.
-
DP_rowsandDP_cols: The difference of proportions, sometimes also called Delta-p (\Delta p), between rows and columns respectively. Both columns are present if"DP"is included inmeasures.DP_rowsis also included ifmeasuresisNULL. They are calculated as follows:-
DP_rows=\frac{a}{m} - \frac{c}{n} -
DP_cols=\frac{a}{k} - \frac{b}{l}
-
-
perc_DIFF_rowsandperc_DIFF_cols: These measures can be seen as normalized versions of Delta-p, i.e. essentially the same measures divided by the denominator and multiplied by100. They therefore express how large the difference of proportions is, relative to the reference proportion. The multiplication by100turns the resulting 'relative difference of proportion' into a percentage. Both columns are present if"perc_DIFF"is included inmeasures. They are calculated as follows:-
perc_DIFF_rows=100 * \frac{(a / m) - (c / n)}{c / n} -
perc_DIFF_cols=100 * \frac{(a / k) - (b / l)}{c / n}
-
-
DC_rowsandDC_cols: The difference coefficient can be seen as a normalized version of Delta-p, i.e. essentially dividing the difference of proportions by the sum of proportions. Both columns are present if"DC"is included inmeasures. They are calculated as follows:-
DC_rows=\frac{(a / m) - (c / n)}{(a / m) + (c / n)} -
DC_cols=\frac{(a / k) - (b / l)}{(a / k) + (b / l)}
-
-
RR_rowsandRR_cols: Relative risk for the rows and columns respectively.RR_rowsrepresents then how large the proportion in the target context is, relative to the proportion in the reference context. Both columns are present if"RR"is included inmeasures.RR_rowsis also included ifmeasuresisNULL. They are calculated as follows:-
RR_rows=\frac{a / m}{c / n} -
RR_cols=\frac{a / k}{b / l}
-
-
LR_rowsandLR_cols: The so-called 'log ratio' of the rows and columns, respectively. It can be seen as a transformed version of the relative risk, viz. its binary log. Both columns are present if"LR"is included inmeasures. They are calculated as follows:-
LR_rows=\log_2\left(\frac{a / m}{c / n}\right) -
LR_cols=\log_2\left(\frac{a / k}{b / l}\right)
-
Other measures use the contingency table in a different way and therefore
don't have a complementary row/column pair. In order to retrieve these columns,
if measures is not "ALL", their name must be in the measures vector.
Some of them are included by default, i.e. if measures is NULL.
-
OR: The odds ratio, which can be calculated either as\frac{a/b}{c/d}or as\frac{a/c}{b/d}. This column is presentmeasuresisNULL. -
log_OR: The log odds ratio, which can be calculated either as\log\left(\frac{a/b}{c/d}\right)or as\log\left(\frac{a/c}{b/d}\right). In other words, it is the natural log of the odds ratio. -
MS: The minimum sensitivity, which is calculated as\min(\frac{a}{m}, \frac{a}{k}). In other words, it is either\frac{a}{m}or\frac{a}{k}, whichever is lowest. This column is presentmeasuresisNULL. -
Jaccard: The Jaccard index, which is calculated as\frac{a}{a + b + c}. It expresses a, which is the frequency of the test item in the target context, relative to b + c + d, i.e. the frequency of all other contexts. -
Dice: The Dice coefficient, which is calculated as\frac{2a}{m + k}. It expresses the harmonic mean of\frac{a}{m}and\frac{a}{k}This column is presentmeasuresisNULL. -
logDice: An adapted version of the Dice coefficient. It is calculated as14 + \log_2\left(\frac{2a}{m + k}\right). In other words, it is14plus the binary log of the Dice coefficient. -
phi: The phi coefficient (\phi), which is calculated as\frac{(a \times d) - (b \times c)}{ \sqrt{m \times n \times k \times l}}. -
Q: Yule's Q, which is calculated as\frac{(a \times d) - (b \times c)}{(a \times d)(b \times c)}. -
mu: The measure mu (\mu), which is calculated as\frac{a}{\mathrm{exp\_a}}(seeexp_a). -
PMIandpos_PMI: (Positive) pointwise mutual information, which can be seen as a modification of the mu measure and is calculated as\log_2\left(\frac{a}{\mathrm{exp\_a}}\right). Inpos_PMI, negative values are set to0. ThePMIcolumn is presentmeasuresisNULL. -
PMI2andPMI3: Modified versions ofPMIthat aim to give relatively more weight to cases with relatively higher a. However, because of this modification, they are not pure effect size measures any more.-
PMI2=\log_2\left(\frac{a^2}{\mathrm{exp\_a}}\right) -
PMI3=\log_2\left(\frac{a^3}{\mathrm{exp\_a}}\right)
-
Strength of evidence measures
The first measures in this section tend to come in triples: a test statistic,
its p-value (preceded by p_) and its signed version (followed by _signed).
The test statistics indicate evidence of either attraction or repulsion.
Thus, in order to indicate the direction of the relationship, a negative
sign is added in the "signed" version when \frac{a}{k} < \frac{c}{l}.
In each of these cases, the name of the main measure (e.g. "chi2")
and/or its signed counterpart (e.g. "chi2_signed") must be in the measures
argument, or measures must be "ALL", for the columns to be included in
the output. If the main function is requested, the signed counterpart will
also be included, but if only the signed counterpart is requested, the non-signed
version will be excluded.
For the p-value to be retrieved, either the main measure or its signed version
must be requested and, additionally, the with_variants argument must be
set to TRUE.
-
chi2,p_chi2andchi2_signed: The chi-squared test statistic (\chi^2) as used in a chi-squared test of independence or in a chi-squared test of homogeneity for a two-by-two contingency table. Scores of this measure are high when there is strong evidence for attraction, but also when there is strong evidence for repulsion. Thechi2_signedcolumn is present ifmeasuresisNULL.chi2is calculated as follows:\frac{(a-\mathrm{exp\_a})^2}{\mathrm{exp\_a}} + \frac{(b-\mathrm{exp\_b})^2}{\mathrm{exp\_b}} + \frac{(c-\mathrm{exp\_c})^2}{\mathrm{exp\_c}} + \frac{(d-\mathrm{exp\_d})^2}{\mathrm{exp\_d}}.
-
chi2_Y,p_chi2_Yandchi2_Y_signed: The chi-squared test statistic (\chi^2) as used in a chi-squared test with Yates correction for a two-by-two contingency table.chi2_Yis calculated as follows:\frac{(|a-\mathrm{exp\_a}| - 0.5)^2}{\mathrm{exp\_a}} + \frac{(|b-\mathrm{exp\_b}| - 0.5)^2}{\mathrm{exp\_b}} + \frac{(|c-\mathrm{exp\_c}| - 0.5)^2}{\mathrm{exp\_c}} + \frac{(|d-\mathrm{exp\_d}| - 0.5)^2}{\mathrm{exp\_d}}.
-
chi2_2T,p_chi2_2Tandchi2_2T_signed: The chi-squared test statistic (\chi^2) as used in a chi-squared goodness-of-fit test applied to the first column of the contingency table. The"2T"in the name stands for 'two terms' (as opposed tochi2, which is sometimes the 'four terms' version).chi2_2Tis calculated as follows:\frac{(a-\mathrm{exp\_a})^2}{\mathrm{exp\_a}} + \frac{(c-\mathrm{exp\_c})^2}{\mathrm{exp\_c}}.
-
chi2_2T_Y,p_chi2_2T_Yandchi2_2T_Y_signed: The chi-squared test statistic (\chi^2) as used in a chi-squared goodness-of-fit test with Yates correction, applied to the first column of the contingency table.chi2_2T_Yis calculated as follows:\frac{(|a-\mathrm{exp\_a}| - 0.5)^2}{\mathrm{exp\_a}} + \frac{(|c-\mathrm{exp\_c}| - 0.5)^2}{\mathrm{exp\_c}}.
-
G,p_GandG_signed: G test statistic, which is also sometimes called log-likelihood ratio (LLR) and, somewhat confusingly, G-squared. This is the test statistic as used in a log-likelihood ratio test for independence or homogeneity in a two-by-two contingency table. Scores are high in case of strong evidence for attraction, but also in case of strong evidence of repulsion. TheG_signedcolumn is present ifmeasuresisNULL.Gis calculated as follows:2 \left( a \times \log(\frac{a}{\mathrm{exp\_a}}) + b \times \log(\frac{b}{\mathrm{exp\_b}}) + c \times \log(\frac{c}{\mathrm{exp\_c}}) + d \times \log(\frac{d}{\mathrm{exp\_d}}) \right) -
G_2T,p_G_2TandG_2T_signed: The test statistic used in a log-likelihood ratio test for goodness-of-fit applied to the first column of the contingency table. The"2T"stands for 'two terms'.G_2Tis calculated as follows:2 \left( a \times \log(\frac{a}{\mathrm{exp\_a}}) + c \times \log(\frac{c}{\mathrm{exp\_c}}) \right)
The final two groups of measures take a different shape. The
_as_chisq1 columns compute qchisq(1 - p, 1), with p being the p-values
they are transforming, i.e. the p right quantile in a \chi^2
distribution with one degree of freedom (see p_to_chisq1()).
-
t,p_t_1,t_1_as_chisq1,p_t_2andt_2_as_chisq1: The t-test statistic, used for a t-test for the proportion\frac{a}{N}in which the null hypothesis is based on\frac{k}{N}\times\frac{m}{N}. Columntis present if"t"is included inmeasuresor ifmeasuresis"ALL"orNULL. The other four columns are present iftis requested and if, additionally,with_variantsisTRUE.-
t=\frac{ a/N + k/N + m/N }{ \sqrt{((a/N)\times (1-a/N))/N} } -
p_t_1is the p-value that corresponds totwhen assuming a one-tailed test that only looks at attraction;t_1_as_chisq1is its transformation. -
p_t_2is the p-value that corresponds totwhen assuming a two-tailed test, viz. that looks at both attraction and repulsion;t_2_as_chisq1is its transformation.
-
-
p_fisher_1,fisher_1_as_chisq1,p_fisher_1r,fisher_1r_as_chisq1: The p-value of a one-sided Fisher exact test. The columnp_fisher_1is present if either"fisher"or"p_fisher"are inmeasuresor ifmeasuresis"ALL"orNULL. The other columns are present ifp_fisher_1as been requested and if, additionally,with_variantsisTRUE.-
p_fisher_1andp_fisher_1rare the p-values of the Fisher exact test that look at attraction and repulsion respectively. -
fisher_1_as_chisq1andfisher_1r_as_chisq1are their respective transformations..
-
-
p_fisher_2andfisher_2_as_chisq1: p-value for a two-sided Fisher exact test, viz. looking at both attraction and repulsion.p_fisher_2returns the p-value andfisher_2_as_chisq1is its transformation. Thep_fisher_2column is present if either"fisher"or"p_fisher_1"are inmeasuresor ifmeasuresis"ALL"orNULLand if, additionally,p_fisher_2isTRUE.fisher_2_as_chisq1is present ifp_fisher_2was requested and, additionally,with_variantsisTRUE.
Properties of the class
An object of class assoc_scores has:
associated
as.data.frame(),print(),sort()andtibble::as_tibble()methods,an interactive
explore()method and useful getters, viz.n_types()andtype_names().
An object of this class can be saved to file with write_assoc() and read
with read_assoc().
Examples
assoc_abcd(10 , 200, 100, 300, types = "four")
assoc_abcd(30, 1000, 14, 5000, types = "fictitious")
assoc_abcd(15, 5000, 16, 1000, types = "toy")
assoc_abcd( 1, 300, 4, 6000, types = "examples")
a <- c(10, 30, 15, 1)
b <- c(200, 1000, 5000, 300)
c <- c(100, 14, 16, 4)
d <- c(300, 5000, 10000, 6000)
types <- c("four", "fictitious", "toy", "examples")
(scores <- assoc_abcd(a, b, c, d, types = types))
as_data_frame(scores)
as_tibble(scores)
print(scores, sort_order = "PMI")
print(scores, sort_order = "alpha")
print(scores, sort_order = "none")
print(scores, sort_order = "nonsense")
print(scores, sort_order = "PMI",
keep_cols = c("a", "exp_a", "PMI", "G_signed"))
print(scores, sort_order = "PMI",
keep_cols = c("a", "b", "c", "d", "exp_a", "G_signed"))
print(scores, sort_order = "PMI",
drop_cols = c("a", "b", "c", "d", "exp_a", "G_signed",
"RR_rows", "chi2_signed", "t"))