BTM {BTM} | R Documentation |
Construct a Biterm Topic Model on Short Text
Description
The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)
A biterm consists of two words co-occurring in the same context, for example, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm
b=(wi,wj)
is defined as:P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}
where k is the number of topics you want to extract.Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for
P(w|k)=phi
andP(z)=theta
.
Usage
BTM(
data,
k = 5,
alpha = 50/k,
beta = 0.01,
iter = 1000,
window = 15,
background = FALSE,
trace = FALSE,
biterms,
detailed = FALSE
)
Arguments
data |
a tokenised data frame containing one row per token with 2 columns
|
k |
integer with the number of topics to identify |
alpha |
numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k. |
beta |
numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(w|z). Defaults to 0.01. |
iter |
integer with the number of iterations of Gibbs sampling |
window |
integer with the window size for biterm extraction. Defaults to 15. |
background |
logical if set to |
trace |
logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE. |
biterms |
optionally, your own set of biterms to use for modelling. |
detailed |
logical indicating to return detailed output containing as well the vocabulary and the biterms used to construct the model. Defaults to FALSE. |
Value
an object of class BTM which is a list containing
model: a pointer to the C++ BTM model
K: the number of topics
W: the number of tokens in the data
alpha: the symmetric dirichlet prior probability of a topic P(z)
beta: the symmetric dirichlet prior probability of a word given the topic P(w|z)
iter: the number of iterations of Gibbs sampling
background: indicator if the first topic is set to the background topic that equals the empirical word distribution.
theta: a vector with the topic probability p(z) which is determinated by the overall proportions of biterms in it
phi: a matrix of dimension W x K with one row for each token in the data. This matrix contains the probability of the token given the topic P(w|z). the rownames of the matrix indicate the token w
vocab: a data.frame with columns token and freq indicating the frequency of occurrence of the tokens in
data
. Only provided in case argumentdetailed
is set toTRUE
biterms: the result of a call to
terms
with type set to biterms, containing all the biterms used in the model. Only provided in case argumentdetailed
is set toTRUE
Note
A biterm is defined as a pair of words co-occurring in the same text window.
If you have as an example a document with sequence of words 'A B C B'
, and assuming the window size is set to 3,
that implies there are two text windows which can generate biterms namely
text window 'A B C'
with biterms 'A B', 'B C', 'A C'
and text window 'B C B'
with biterms 'B C', 'C B', 'B B'
A biterm is an unorder word pair where 'B C' = 'C B'
. Thus, the document 'A B C B'
will have the following biterm frequencies:
'A B': 1
'B C': 3
'A C': 1
'B B': 1
These biterms are used to create the model.
References
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013, https://github.com/xiaohuiyan/BTM, https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf
See Also
predict.BTM
, terms.BTM
, logLik.BTM
Examples
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
model <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE)
model
terms(model)
scores <- predict(model, newdata = x)
## Another small run with first topic the background word distribution
set.seed(123456)
model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE)
model
terms(model)
##
## You can also provide your own set of biterms to cluster upon
## Example: cluster nouns and adjectives in the neighbourhood of one another
##
library(data.table)
library(udpipe)
x <- subset(brussels_reviews_anno, language == "nl")
x <- head(x, 5500) # take a sample to speed things up on CRAN
biterms <- as.data.table(x)
biterms <- biterms[, cooccurrence(x = lemma,
relevant = xpos %in% c("NN", "NNP", "NNS", "JJ"),
skipgram = 2),
by = list(doc_id)]
head(biterms)
set.seed(123456)
x <- subset(x, xpos %in% c("NN", "NNP", "NNS", "JJ"))
x <- x[, c("doc_id", "lemma")]
model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE,
biterms = biterms, trace = 10, detailed = TRUE)
model
terms(model)
bitermset <- terms(model, "biterms")
head(bitermset$biterms, 100)
bitermset$n
sum(biterms$cooc)
## Not run:
##
## Visualisation either using the textplot or the LDAvis package
##
library(textplot)
library(ggraph)
library(concaveman)
plot(model, top_n = 4)
library(LDAvis)
docsize <- table(x$doc_id)
scores <- predict(model, x)
scores <- scores[names(docsize), ]
json <- createJSON(
phi = t(model$phi),
theta = scores,
doc.length = as.integer(docsize),
vocab = model$vocabulary$token,
term.frequency = model$vocabulary$freq)
serVis(json)
## End(Not run)