R: Distributed Representations of Topics

top2vec {doc2vec}

R Documentation

Distributed Representations of Topics

Description

Perform text clustering by using semantic embeddings of documents and words to find topics of text documents which are semantically similar.

Usage

top2vec(
  x,
  data = data.frame(doc_id = character(), text = character(), stringsAsFactors = FALSE),
  control.umap = list(n_neighbors = 15L, n_components = 5L, metric = "cosine"),
  control.dbscan = list(minPts = 100L),
  control.doc2vec = list(),
  umap = uwot::umap,
  trace = FALSE,
  ...
)

Arguments

`x`	either an object returned by `paragraph2vec` or a data.frame with columns 'doc_id' and 'text' storing document ids and texts as character vectors or a matrix with document embeddings to cluster or a list with elements docs and words containing document embeddings to cluster and word embeddings for deriving topic summaries
`data`	optionally, a data.frame with columns 'doc_id' and 'text' representing documents. This dataset is just stored, in order to extract the text of the most similar documents to a topic. If it also contains a field 'text_doc2vec', this will be used to indicate the most relevant topic words by class-based tfidf
`control.umap`	a list of arguments to pass on to `umap` for reducing the dimensionality of the embedding space
`control.dbscan`	a list of arguments to pass on to `hdbscan` for clustering the reduced embedding space
`control.doc2vec`	optionally, a list of arguments to pass on to `paragraph2vec` in case `x` is a data.frame instead of a doc2vec model trained by `paragraph2vec`
`umap`	function to apply UMAP. Defaults to `umap`, can as well be `tumap`
`trace`	logical indicating to print evolution of the algorithm
`...`	further arguments not used yet

Value

an object of class top2vec which is a list with elements

embedding: a list of matrices with word and document embeddings
doc2vec: a doc2vec model
umap: a matrix of representations of the documents of x
dbscan: the result of the hdbscan clustering
data: a data.frame with columns doc_id and text
size: a vector of frequency statistics of topic occurrence
k: the number of clusters
control: a list of control arguments to doc2vec / umap / dbscan

Note

The topic '0' is the noise topic

References

https://arxiv.org/abs/2008.09470

Examples



library(word2vec)
library(uwot)
library(dbscan)
data(be_parliament_2020, package = "doc2vec")
x      <- data.frame(doc_id = be_parliament_2020$doc_id,
                     text   = be_parliament_2020$text_nl,
                     stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)
x      <- subset(x, txt_count_words(text) < 1000)
d2v    <- paragraph2vec(x, type = "PV-DBOW", dim = 50, 
                        lr = 0.05, iter = 10,
                        window = 15, hs = TRUE, negative = 0,
                        sample = 0.00001, min_count = 5, 
                        threads = 1)
# write.paragraph2vec(d2v, "d2v.bin")
# d2v    <- read.paragraph2vec("d2v.bin")
model  <- top2vec(d2v, data = x,
                  control.dbscan = list(minPts = 50), 
                  control.umap = list(n_neighbors = 15L, n_components = 4), trace = TRUE)
model  <- top2vec(d2v, data = x,
                  control.dbscan = list(minPts = 50), 
                  control.umap = list(n_neighbors = 15L, n_components = 3), umap = tumap, 
                  trace = TRUE)
                                  
info   <- summary(model, top_n = 7)
info$topwords
info$topdocs
library(udpipe)
info   <- summary(model, top_n = 7, type = "c-tfidf")
info$topwords

## Change the model: reduce doc2vec model to 2D
model  <- update(model, type = "umap", 
                 n_neighbors = 100, n_components = 2, metric = "cosine", umap = tumap, 
                 trace = TRUE)
info   <- summary(model, top_n = 7)
data = x
info$topwords
info$topdocs

## Change the model: have minimum 200 points for the core elements in the hdbscan density
model  <- update(model, type = "hdbscan", minPts = 200, trace = TRUE)
info   <- summary(model, top_n = 7)
data = x
info$topwords
info$topdocs



##
## Example on a small sample 
##  with unrealistic hyperparameter settings especially regarding dim / iter / n_epochs
##  in order to have a basic example finishing < 5 secs
##

library(uwot)
library(dbscan)
library(word2vec)
data(be_parliament_2020, package = "doc2vec")
x        <- data.frame(doc_id = be_parliament_2020$doc_id,
                       text   = be_parliament_2020$text_nl,
                       stringsAsFactors = FALSE)
x        <- head(x, 1000)
x$text   <- txt_clean_word2vec(x$text)
x        <- subset(x, txt_count_words(text) < 1000)
d2v      <- paragraph2vec(x, type = "PV-DBOW", dim = 10, 
                          lr = 0.05, iter = 0,
                          window = 5, hs = TRUE, negative = 0,
                          sample = 0.00001, min_count = 5)
emb      <- list(docs  = as.matrix(d2v, which = "docs"),
                 words = as.matrix(d2v, which = "words"))
model    <- top2vec(emb, 
                    data = x,
                    control.dbscan = list(minPts = 50), 
                    control.umap = list(n_neighbors = 15, n_components = 2, 
                                        init = "spectral"), 
                    umap = tumap, trace = TRUE)
info     <- summary(model, top_n = 7)
print(info, top_n = c(5, 2))

[Package doc2vec version 0.2.0 Index]