terms.data.frame {BTM} | R Documentation |
Get the set of Biterms from a tokenised data frame
Description
This extracts words occurring in the neighbourhood of one another, within a certain window range.
The default setting provides the biterms used when fitting BTM
with the default window parameter.
Usage
## S3 method for class 'data.frame'
terms(x, type = c("tokens", "biterms"), window = 15, ...)
Arguments
x |
a tokenised data frame containing one row per token with 2 columns
|
type |
a character string, either 'tokens' or 'biterms'. Defaults to 'tokens'. |
window |
integer with the window size for biterm extraction. Defaults to 15. |
... |
not used |
Value
Depending if type is set to 'tokens' or 'biterms' the following is returned:
If
type='tokens'
: a list containing 2 elements:-
n
which indicates the number of tokens -
tokens
which is a data.frame with columns id, token and freq, indicating for all tokens found in the data the frequency of occurrence
-
If
type='biterms'
: a list containing 2 elements:-
n
which indicates the number of biterms used to train the model -
biterms
which is a data.frame with columns term1 and term2, indicating all biterms found in the data. The same biterm combination can occur several times.
Note that a biterm is unordered, in the output of
type='biterms'
term1 is always smaller than or equal to term2.-
Note
If x
is a data.frame which has an attribute called 'terms', it just returns that 'terms'
attribute
See Also
Examples
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
biterms <- terms(x, window = 15, type = "biterms")
str(biterms)
tokens <- terms(x, type = "tokens")
str(tokens)