cooccurrence {udpipe} | R Documentation |
Create a cooccurence data.frame
Description
A cooccurence data.frame indicates how many times each term co-occurs with another term.
There are 3 types of cooccurrences:
Looking at which words are located in the same document/sentence/paragraph.
Looking at which words are followed by another word
Looking at which words are in the neighbourhood of the word as in follows the word within
skipgram
number of words
The output of the function gives a cooccurrence data.frame which contains the fields term1, term2 and cooc where cooc indicates how many times term1 and term2 co-occurred. This dataset can be constructed
based upon a data frame where you look within a group (column of the data.frame) if 2 terms occurred in that group.
based upon a vector of words in which case we look how many times each word is followed by another word.
based upon a vector of words in which case we look how many times each word is followed by another word or is followed by another word if we skip a number of words in between.
Note that
For cooccurrence.data.frame no ordering is assumed which implies that the function does not return self-occurrences if a word occurs several times in the same group of text and term1 is always smaller than term2 in the output
For cooccurrence.character we assume text is ordered from left to right, the function as well returns self-occurrences
You can also aggregate cooccurrences if you decide to do any of these 3 by a certain group and next want to obtain an overall aggregate.
Usage
cooccurrence(x, order = TRUE, ...)
## S3 method for class 'character'
cooccurrence(
x,
order = TRUE,
...,
relevant = rep(TRUE, length(x)),
skipgram = 0
)
## S3 method for class 'cooccurrence'
cooccurrence(x, order = TRUE, ...)
## S3 method for class 'data.frame'
cooccurrence(x, order = TRUE, ..., group, term)
Arguments
x |
either
|
order |
logical indicating if we need to sort the output from high cooccurrences to low coccurrences. Defaults to TRUE. |
... |
other arguments passed on to the methods |
relevant |
a logical vector of the same length as |
skipgram |
integer of length 1, indicating how far in the neighbourhood to look for words. |
group |
character vector of columns in the data frame |
term |
character string of a column in the data frame |
Value
a data.frame with columns term1, term2 and cooc indicating for the combination of term1 and term2 how many times this combination occurred
Methods (by class)
-
character
: Create a cooccurence data.frame based on a vector of terms -
cooccurrence
: Aggregate co-occurrence statistics by summing the cooc by term/term2 -
data.frame
: Create a cooccurence data.frame based on a data.frame where you look within a document / sentence / paragraph / group if terms co-occur
Examples
data(brussels_reviews_anno)
## By document, which lemma's co-occur
x <- subset(brussels_reviews_anno, xpos %in% c("NN", "JJ") & language %in% "fr")
x <- cooccurrence(x, group = "doc_id", term = "lemma")
head(x)
## Which words follow each other
x <- c("A", "B", "A", "A", "B", "c")
cooccurrence(x)
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "es")
x <- cooccurrence(x$lemma)
head(x)
x <- subset(brussels_reviews_anno, language == "es")
x <- cooccurrence(x$lemma, relevant = x$xpos %in% c("NN", "JJ"), skipgram = 4)
head(x)
## Which nouns follow each other in the same document
library(data.table)
x <- as.data.table(brussels_reviews_anno)
x <- subset(x, language == "nl" & xpos %in% c("NN"))
x <- x[, cooccurrence(lemma, order = FALSE), by = list(doc_id)]
head(x)
x_nodoc <- cooccurrence(x)
x_nodoc <- subset(x_nodoc, term1 != "appartement" & term2 != "appartement")
head(x_nodoc)