cooccurrence {udpipe}R Documentation

Create a cooccurence data.frame

Description

A cooccurence data.frame indicates how many times each term co-occurs with another term.

There are 3 types of cooccurrences:

The output of the function gives a cooccurrence data.frame which contains the fields term1, term2 and cooc where cooc indicates how many times term1 and term2 co-occurred. This dataset can be constructed

Note that

You can also aggregate cooccurrences if you decide to do any of these 3 by a certain group and next want to obtain an overall aggregate.

Usage

cooccurrence(x, order = TRUE, ...)

## S3 method for class 'character'
cooccurrence(
  x,
  order = TRUE,
  ...,
  relevant = rep(TRUE, length(x)),
  skipgram = 0
)

## S3 method for class 'cooccurrence'
cooccurrence(x, order = TRUE, ...)

## S3 method for class 'data.frame'
cooccurrence(x, order = TRUE, ..., group, term)

Arguments

x

either

  • a data.frame where the data.frame contains 1 row per document/term, in which case you need to provide group and term where term is the column containing 1 term per row and group indicates something like a document id or document + sentence id. This uses cooccurrence.data.frame.

  • a character vector with terms where one element contains 1 term. This uses cooccurrence.character.

  • an object of class cooccurrence. This uses cooccurrence.cooccurrence.

order

logical indicating if we need to sort the output from high cooccurrences to low coccurrences. Defaults to TRUE.

...

other arguments passed on to the methods

relevant

a logical vector of the same length as x, indicating if the word in x is relevant or not. This can be used to exclude stopwords from the cooccurrence calculation or selecting only nouns and adjectives to find cooccurrences along with each other (for example based on the Parts of Speech upos output from udpipe_annotate).
Only used if calculating cooccurrences on x which is a character vector of words.

skipgram

integer of length 1, indicating how far in the neighbourhood to look for words.
skipgram is considered the maximum skip distance between words to calculate co-occurrences (where co-occurrences are of type skipgram-bigram, where a skipgram-bigram are 2 words which occur at a distance of at most skipgram + 1 from each other).
Only used if calculating cooccurrences on x which is a character vector of words.

group

character vector of columns in the data frame x indicating to calculate cooccurrences within these columns.
This is typically a field like document id or a sentence identifier. To be used if x is a data.frame.

term

character string of a column in the data frame x, containing 1 term per row. To be used if x is a data.frame.

Value

a data.frame with columns term1, term2 and cooc indicating for the combination of term1 and term2 how many times this combination occurred

Methods (by class)

Examples


data(brussels_reviews_anno)

## By document, which lemma's co-occur
x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
x <- cooccurrence(x, group = "doc_id", term = "lemma")
head(x)

## Which words follow each other
x <- c("A", "B", "A", "A", "B", "c")
cooccurrence(x)

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "es")
x <- cooccurrence(x$lemma)
head(x)
x <- subset(brussels_reviews_anno, language == "es")
x <- cooccurrence(x$lemma, relevant = x$xpos %in% c("NN", "JJ"), skipgram = 4)
head(x)

## Which nouns follow each other in the same document
library(data.table)
x <- as.data.table(brussels_reviews_anno)
x <- subset(x, language == "nl" & xpos %in% c("NN"))
x <- x[, cooccurrence(lemma, order = FALSE), by = list(doc_id)]
head(x)

x_nodoc <- cooccurrence(x)
x_nodoc <- subset(x_nodoc, term1 != "appartement" & term2 != "appartement")
head(x_nodoc)

[Package udpipe version 0.8.11 Index]