top2vec {doc2vec} | R Documentation |
Distributed Representations of Topics
Description
Perform text clustering by using semantic embeddings of documents and words to find topics of text documents which are semantically similar.
Usage
top2vec(
x,
data = data.frame(doc_id = character(), text = character(), stringsAsFactors = FALSE),
control.umap = list(n_neighbors = 15L, n_components = 5L, metric = "cosine"),
control.dbscan = list(minPts = 100L),
control.doc2vec = list(),
umap = uwot::umap,
trace = FALSE,
...
)
Arguments
x |
either an object returned by |
data |
optionally, a data.frame with columns 'doc_id' and 'text' representing documents. This dataset is just stored, in order to extract the text of the most similar documents to a topic. If it also contains a field 'text_doc2vec', this will be used to indicate the most relevant topic words by class-based tfidf |
control.umap |
a list of arguments to pass on to |
control.dbscan |
a list of arguments to pass on to |
control.doc2vec |
optionally, a list of arguments to pass on to |
umap |
function to apply UMAP. Defaults to |
trace |
logical indicating to print evolution of the algorithm |
... |
further arguments not used yet |
Value
an object of class top2vec
which is a list with elements
embedding: a list of matrices with word and document embeddings
doc2vec: a doc2vec model
umap: a matrix of representations of the documents of
x
dbscan: the result of the hdbscan clustering
data: a data.frame with columns doc_id and text
size: a vector of frequency statistics of topic occurrence
k: the number of clusters
control: a list of control arguments to doc2vec / umap / dbscan
Note
The topic '0' is the noise topic
References
https://arxiv.org/abs/2008.09470
See Also
Examples
library(word2vec)
library(uwot)
library(dbscan)
data(be_parliament_2020, package = "doc2vec")
x <- data.frame(doc_id = be_parliament_2020$doc_id,
text = be_parliament_2020$text_nl,
stringsAsFactors = FALSE)
x$text <- txt_clean_word2vec(x$text)
x <- subset(x, txt_count_words(text) < 1000)
d2v <- paragraph2vec(x, type = "PV-DBOW", dim = 50,
lr = 0.05, iter = 10,
window = 15, hs = TRUE, negative = 0,
sample = 0.00001, min_count = 5,
threads = 1)
# write.paragraph2vec(d2v, "d2v.bin")
# d2v <- read.paragraph2vec("d2v.bin")
model <- top2vec(d2v, data = x,
control.dbscan = list(minPts = 50),
control.umap = list(n_neighbors = 15L, n_components = 4), trace = TRUE)
model <- top2vec(d2v, data = x,
control.dbscan = list(minPts = 50),
control.umap = list(n_neighbors = 15L, n_components = 3), umap = tumap,
trace = TRUE)
info <- summary(model, top_n = 7)
info$topwords
info$topdocs
library(udpipe)
info <- summary(model, top_n = 7, type = "c-tfidf")
info$topwords
## Change the model: reduce doc2vec model to 2D
model <- update(model, type = "umap",
n_neighbors = 100, n_components = 2, metric = "cosine", umap = tumap,
trace = TRUE)
info <- summary(model, top_n = 7)
data = x
info$topwords
info$topdocs
## Change the model: have minimum 200 points for the core elements in the hdbscan density
model <- update(model, type = "hdbscan", minPts = 200, trace = TRUE)
info <- summary(model, top_n = 7)
data = x
info$topwords
info$topdocs
##
## Example on a small sample
## with unrealistic hyperparameter settings especially regarding dim / iter / n_epochs
## in order to have a basic example finishing < 5 secs
##
library(uwot)
library(dbscan)
library(word2vec)
data(be_parliament_2020, package = "doc2vec")
x <- data.frame(doc_id = be_parliament_2020$doc_id,
text = be_parliament_2020$text_nl,
stringsAsFactors = FALSE)
x <- head(x, 1000)
x$text <- txt_clean_word2vec(x$text)
x <- subset(x, txt_count_words(text) < 1000)
d2v <- paragraph2vec(x, type = "PV-DBOW", dim = 10,
lr = 0.05, iter = 0,
window = 5, hs = TRUE, negative = 0,
sample = 0.00001, min_count = 5)
emb <- list(docs = as.matrix(d2v, which = "docs"),
words = as.matrix(d2v, which = "words"))
model <- top2vec(emb,
data = x,
control.dbscan = list(minPts = 50),
control.umap = list(n_neighbors = 15, n_components = 2,
init = "spectral"),
umap = tumap, trace = TRUE)
info <- summary(model, top_n = 7)
print(info, top_n = c(5, 2))