R: Get the set of Biterms from a tokenised data frame

terms.data.frame {BTM}

R Documentation

Get the set of Biterms from a tokenised data frame

Description

This extracts words occurring in the neighbourhood of one another, within a certain window range. The default setting provides the biterms used when fitting BTM with the default window parameter.

Usage

## S3 method for class 'data.frame'
terms(x, type = c("tokens", "biterms"), window = 15, ...)

Arguments

`x`	a tokenised data frame containing one row per token with 2 columns the first column is a context identifier (e.g. a tweet id, a document id, a sentence id, an identifier of a survey answer, an identifier of a part of a text) the second column is a column called of type character containing the sequence of words occurring within the context identifier
`type`	a character string, either 'tokens' or 'biterms'. Defaults to 'tokens'.
`window`	integer with the window size for biterm extraction. Defaults to 15.
`...`	not used

Value

Depending if type is set to 'tokens' or 'biterms' the following is returned:

If type='tokens': a list containing 2 elements:
- n which indicates the number of tokens
- tokens which is a data.frame with columns id, token and freq, indicating for all tokens found in the data the frequency of occurrence
If type='biterms': a list containing 2 elements:
- n which indicates the number of biterms used to train the model
- biterms which is a data.frame with columns term1 and term2, indicating all biterms found in the data. The same biterm combination can occur several times.
Note that a biterm is unordered, in the output of type='biterms' term1 is always smaller than or equal to term2.

Note

If x is a data.frame which has an attribute called 'terms', it just returns that 'terms' attribute

Examples


library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
biterms <- terms(x, window = 15, type = "biterms")
str(biterms)
tokens <- terms(x, type = "tokens")
str(tokens)

[Package BTM version 0.3.7 Index]