R: Construct a Biterm Topic Model on Short Text

BTM {BTM}

R Documentation

Construct a Biterm Topic Model on Short Text

Description

The Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms)

A biterm consists of two words co-occurring in the same context, for example, in the same short text window.
BTM models the biterm occurrences in a corpus (unlike LDA models which model the word occurrences in a document).
It's a generative model. In the generation procedure, a biterm is generated by drawing two words independently from a same topic z. In other words, the distribution of a biterm b=(wi,wj) is defined as: P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)} where k is the number of topics you want to extract.
Estimation of the topic model is done with the Gibbs sampling algorithm. Where estimates are provided for P(w|k)=phi and P(z)=theta.

Usage

BTM(
  data,
  k = 5,
  alpha = 50/k,
  beta = 0.01,
  iter = 1000,
  window = 15,
  background = FALSE,
  trace = FALSE,
  biterms,
  detailed = FALSE
)

Arguments

`data`	a tokenised data frame containing one row per token with 2 columns the first column is a context identifier (e.g. a tweet id, a document id, a sentence id, an identifier of a survey answer, an identifier of a part of a text) the second column is a column called of type character containing the sequence of words occurring within the context identifier
`k`	integer with the number of topics to identify
`alpha`	numeric, indicating the symmetric dirichlet prior probability of a topic P(z). Defaults to 50/k.
`beta`	numeric, indicating the symmetric dirichlet prior probability of a word given the topic P(w\|z). Defaults to 0.01.
`iter`	integer with the number of iterations of Gibbs sampling
`window`	integer with the window size for biterm extraction. Defaults to 15.
`background`	logical if set to `TRUE`, the first topic is set to a background topic that equals to the empirical word distribution. This can be used to filter out common words. Defaults to FALSE.
`trace`	logical indicating to print out evolution of the Gibbs sampling iterations. Defaults to FALSE.
`biterms`	optionally, your own set of biterms to use for modelling. This argument should be a data.frame with column names doc_id, term1, term2 and cooc, indicating how many times each biterm (as indicated by terms term1 and term2) is occurring within a certain doc_id. The field cooc indicates how many times this biterm happens with the doc_id. Note that doc_id's which are not in `data` are not allowed, as well as terms (in term1 and term2) which are not also in `data`. See the examples. If provided, the `window` argument is ignored and the `data` argument will only be used to calculate the background word frequency distribution.
`detailed`	logical indicating to return detailed output containing as well the vocabulary and the biterms used to construct the model. Defaults to FALSE.

Value

an object of class BTM which is a list containing

model: a pointer to the C++ BTM model
K: the number of topics
W: the number of tokens in the data
alpha: the symmetric dirichlet prior probability of a topic P(z)
beta: the symmetric dirichlet prior probability of a word given the topic P(w|z)
iter: the number of iterations of Gibbs sampling
background: indicator if the first topic is set to the background topic that equals the empirical word distribution.
theta: a vector with the topic probability p(z) which is determinated by the overall proportions of biterms in it
phi: a matrix of dimension W x K with one row for each token in the data. This matrix contains the probability of the token given the topic P(w|z). the rownames of the matrix indicate the token w
vocab: a data.frame with columns token and freq indicating the frequency of occurrence of the tokens in data. Only provided in case argument detailed is set to TRUE
biterms: the result of a call to terms with type set to biterms, containing all the biterms used in the model. Only provided in case argument detailed is set to TRUE

Note

A biterm is defined as a pair of words co-occurring in the same text window. If you have as an example a document with sequence of words 'A B C B', and assuming the window size is set to 3, that implies there are two text windows which can generate biterms namely text window 'A B C' with biterms 'A B', 'B C', 'A C' and text window 'B C B' with biterms 'B C', 'C B', 'B B' A biterm is an unorder word pair where 'B C' = 'C B'. Thus, the document 'A B C B' will have the following biterm frequencies:

'A B': 1
'B C': 3
'A C': 1
'B B': 1

These biterms are used to create the model.

References

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013, https://github.com/xiaohuiyan/BTM, https://github.com/xiaohuiyan/xiaohuiyan.github.io/blob/master/paper/BTM-WWW13.pdf

Examples


library(udpipe)
data("brussels_reviews_anno", package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]
model  <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 10, trace = TRUE)
model
terms(model)
scores <- predict(model, newdata = x)

## Another small run with first topic the background word distribution
set.seed(123456)
model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE)
model
terms(model)

##
## You can also provide your own set of biterms to cluster upon
## Example: cluster nouns and adjectives in the neighbourhood of one another
##
library(data.table)
library(udpipe)
x <- subset(brussels_reviews_anno, language == "nl")
x <- head(x, 5500) # take a sample to speed things up on CRAN
biterms <- as.data.table(x)
biterms <- biterms[, cooccurrence(x = lemma, 
                                  relevant = xpos %in% c("NN", "NNP", "NNS", "JJ"),
                                  skipgram = 2), 
                   by = list(doc_id)]
head(biterms)
set.seed(123456)
x <- subset(x, xpos %in% c("NN", "NNP", "NNS", "JJ"))
x <- x[, c("doc_id", "lemma")]
model <- BTM(x, k = 5, beta = 0.01, iter = 10, background = TRUE, 
             biterms = biterms, trace = 10, detailed = TRUE)
model
terms(model)
bitermset <- terms(model, "biterms")
head(bitermset$biterms, 100)

bitermset$n
sum(biterms$cooc)


## Not run: 
##
## Visualisation either using the textplot or the LDAvis package
##
library(textplot)
library(ggraph)
library(concaveman)
plot(model, top_n = 4)

library(LDAvis)
docsize <- table(x$doc_id)
scores  <- predict(model, x)
scores  <- scores[names(docsize), ]
json <- createJSON(
  phi = t(model$phi), 
  theta = scores, 
  doc.length = as.integer(docsize),
  vocab = model$vocabulary$token, 
  term.frequency = model$vocabulary$freq)
serVis(json)

## End(Not run)

[Package BTM version 0.3.7 Index]