R: Identify and score multi-word expressions

textstat_collocations {quanteda.textstats}

R Documentation

Identify and score multi-word expressions

Description

Identify and score multi-word expressions, or adjacent fixed-length collocations, from text.

Usage

textstat_collocations(
  x,
  method = "lambda",
  size = 2,
  min_count = 2,
  smoothing = 0.5,
  tolower = TRUE,
  ...
)

Arguments

`x`	a character, corpus, or tokens object whose collocations will be scored. The tokens object should include punctuation, and if any words have been removed, these should have been removed with `padding = TRUE`. While identifying collocations for tokens objects is supported, you will get better results with character or corpus objects due to relatively imperfect detection of sentence boundaries from texts already tokenized.
`method`	association measure for detecting collocations. Currently this is limited to `"lambda"`. See Details.
`size`	integer; the length of the collocations to be scored
`min_count`	numeric; minimum frequency of collocations that will be scored
`smoothing`	numeric; a smoothing parameter added to the observed counts (default is 0.5)
`tolower`	logical; if `TRUE`, form collocations as lower-cased combinations
`...`	additional arguments passed to `tokens()`

Details

Documents are grouped for the purposes of scoring, but collocations will not span sentences. If x is a tokens object and some tokens have been removed, this should be done using ⁠[tokens_remove](x, pattern, padding = TRUE)⁠ so that counts will still be accurate, but the pads will prevent those collocations from being scored.

The lambda computed for a size = K-word target multi-word expression the coefficient for the K-way interaction parameter in the saturated log-linear model fitted to the counts of the terms forming the set of eligible multi-word expressions. This is the same as the "lambda" computed in Blaheta and Johnson's (2001), where all multi-word expressions are considered (rather than just verbs, as in that paper). The z is the Wald z-statistic computed as the quotient of lambda and the Wald statistic for lambda as described below.

In detail:

Consider a K-word target expression x, and let z be any K-word expression. Define a comparison function c(x,z)=(j_{1}, \dots, j_{K})=c such that the kth element of c is 1 if the kth word in z is equal to the kth word in x, and 0 otherwise. Let c_{i}=(j_{i1}, \dots, j_{iK}), i=1, \dots, 2^{K}=M, be the possible values of c(x,z), with c_{M}=(1,1, \dots, 1). Consider the set of c(x,z_{r}) across all expressions z_{r} in a corpus of text, and let n_{i}, for i=1,\dots,M, denote the number of the c(x,z_{r}) which equal c_{i}, plus the smoothing constant smoothing. The n_{i} are the counts in a 2^{K} contingency table whose dimensions are defined by the c_{i}.

\lambda: The K-way interaction parameter in the saturated loglinear model fitted to the n_{i}. It can be calculated as

\lambda = \sum_{i=1}^{M} (-1)^{K-b_{i}} * log n_{i}

where b_{i} is the number of the elements of c_{i} which are equal to 1.

Wald test z-statistic z is calculated as:

z = \frac{\lambda}{[\sum_{i=1}^{M} n_{i}^{-1}]^{(1/2)}}

Value

textstat_collocations returns a data.frame of collocations and their scores and statistics. This consists of the collocations, their counts, length, and \lambda and z statistics. When size is a vector, then count_nested counts the lower-order collocations that occur within a higher-order collocation (but this does not affect the statistics).

Author(s)

Kenneth Benoit, Jouni Kuha, Haiyan Wang, and Kohei Watanabe

References

Blaheta, D. & Johnson, M. (2001). Unsupervised learning of multi-word verbs. Presented at the ACLEACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations.

Examples

library("quanteda")
corp <- data_corpus_inaugural[1:2]
head(cols <- textstat_collocations(corp, size = 2, min_count = 2), 10)
head(cols <- textstat_collocations(corp, size = 3, min_count = 2), 10)

# extracting multi-part proper nouns (capitalized terms)
toks1 <- tokens(data_corpus_inaugural)
toks2 <- tokens_remove(toks1, pattern = stopwords("english"), padding = TRUE)
toks3 <- tokens_select(toks2, pattern = "^([A-Z][a-z\\-]{2,})", valuetype = "regex",
                       case_insensitive = FALSE, padding = TRUE)
tstat <- textstat_collocations(toks3, size = 3, tolower = FALSE)
head(tstat, 10)

# vectorized size
txt <- c(". . . . a b c . . a b c . . . c d e",
         "a b . . a b . . a b . . a b . a b",
         "b c d . . b c . b c . . . b c")
textstat_collocations(txt, size = 2:3)

# compounding tokens from collocations
toks <- tokens("This is the European Union.")
colls <- tokens("The new European Union is not the old European Union.") %>%
    textstat_collocations(size = 2, min_count = 1, tolower = FALSE)
colls
tokens_compound(toks, colls, case_insensitive = FALSE)

#' # from a collocations object
(coll <- textstat_collocations(tokens("a b c a b d e b d a b")))
phrase(coll)

[Package quanteda.textstats version 0.97 Index]