BTM {BTM}  R Documentation 
Construct a Biterm Topic Model on Short Text
Description
The Biterm Topic Model (BTM) is a word cooccurrence based topic model that learns topics by modeling wordword cooccurrences patterns (e.g., biterms)
A biterm consists of two words cooccurring in the same context, for example, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm
b=(wi,wj)
is defined as:P(b) = \sum_k{P(wiz)*P(wjz)*P(z)}
where k is the number of topics you want to extract.Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for
P(wk)=phi
andP(z)=theta
.
Usage
BTM(
data,
k = 5,
alpha = 50/k,
beta = 0.01,
iter = 1000,
window = 15,
background = FALSE,
trace = FALSE,
biterms,
detailed = FALSE
)
Arguments
data 
a tokenised data frame containing one row per token with 2 columns

k 
integer with the number of topics to identify 
alpha 
numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k. 
beta 
numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(wz). Defaults to 0.01. 
iter 
integer with the number of iterations of Gibbs sampling 
window 
integer with the window size for biterm extraction. Defaults to 15. 
background 
logical if set to 
trace 
logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE. 
biterms 
optionally, your own set of biterms to use for modelling. 
detailed 
logical indicating to return detailed output containing as well the vocabulary and the biterms used to construct the model. Defaults to FALSE. 
Value
an object of class BTM which is a list containing
model: a pointer to the C++ BTM model
K: the number of topics
W: the number of tokens in the data
alpha: the symmetric dirichlet prior probability of a topic P(z)
beta: the symmetric dirichlet prior probability of a word given the topic P(wz)
iter: the number of iterations of Gibbs sampling
background: indicator if the first topic is set to the background topic that equals the empirical word distribution.
theta: a vector with the topic probability p(z) which is determinated by the overall proportions of biterms in it
phi: a matrix of dimension W x K with one row for each token in the data. This matrix contains the probability of the token given the topic P(wz). the rownames of the matrix indicate the token w
vocab: a data.frame with columns token and freq indicating the frequency of occurrence of the tokens in
data
. Only provided in case argumentdetailed
is set toTRUE
biterms: the result of a call to
terms
with type set to biterms, containing all the biterms used in the model. Only provided in case argumentdetailed
is set toTRUE
Note
A biterm is defined as a pair of words cooccurring in the same text window.
If you have as an example a document with sequence of words 'A B C B'
, and assuming the window size is set to 3,
that implies there are two text windows which can generate biterms namely
text window 'A B C'
with biterms 'A B', 'B C', 'A C'
and text window 'B C B'
with biterms 'B C', 'C B', 'B B'
A biterm is an unorder word pair where 'B C' = 'C B'
. Thus, the document 'A B C B'
will have the following biterm frequencies:
'A B': 1
'B C': 3
'A C': 1
'B B': 1
These biterms are used to create the model.
References
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013, https://github.com/xiaohuiyan/BTM, https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTMWWW13.pdf
See Also
predict.BTM
, terms.BTM
, logLik.BTM
Examples
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x < subset(brussels_reviews_anno, language == "nl")
x < subset(x, xpos %in% c("NN", "NNP", "NNS"))
x < x[, c("doc_id", "lemma")]
model < BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE)
model
terms(model)
scores < predict(model, newdata = x)
## Another small run with first topic the background word distribution
set.seed(123456)
model < BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE)
model
terms(model)
##
## You can also provide your own set of biterms to cluster upon
## Example: cluster nouns and adjectives in the neighbourhood of one another
##
library(data.table)
library(udpipe)
x < subset(brussels_reviews_anno, language == "nl")
x < head(x, 5500) # take a sample to speed things up on CRAN
biterms < as.data.table(x)
biterms < biterms[, cooccurrence(x = lemma,
relevant = xpos %in% c("NN", "NNP", "NNS", "JJ"),
skipgram = 2),
by = list(doc_id)]
head(biterms)
set.seed(123456)
x < subset(x, xpos %in% c("NN", "NNP", "NNS", "JJ"))
x < x[, c("doc_id", "lemma")]
model < BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE,
biterms = biterms, trace = 10, detailed = TRUE)
model
terms(model)
bitermset < terms(model, "biterms")
head(bitermset$biterms, 100)
bitermset$n
sum(biterms$cooc)
## Not run:
##
## Visualisation either using the textplot or the LDAvis package
##
library(textplot)
library(ggraph)
library(concaveman)
plot(model, top_n = 4)
library(LDAvis)
docsize < table(x$doc_id)
scores < predict(model, x)
scores < scores[names(docsize), ]
json < createJSON(
phi = t(model$phi),
theta = scores,
doc.length = as.integer(docsize),
vocab = model$vocabulary$token,
term.frequency = model$vocabulary$freq)
serVis(json)
## End(Not run)