paragraph2vec {doc2vec} | R Documentation |
Train a paragraph2vec also known as doc2vec model on text
Description
Construct a paragraph2vec model on text.
The algorithm is explained at https://arxiv.org/pdf/1405.4053.pdf.
People also refer to this model as doc2vec.
The model is an extension to the word2vec algorithm,
where an additional vector for every paragraph is added directly in the training.
Usage
paragraph2vec(
x,
type = c("PV-DBOW", "PV-DM"),
dim = 50,
window = ifelse(type == "PV-DM", 5L, 10L),
iter = 5L,
lr = 0.05,
hs = FALSE,
negative = 5L,
sample = 0.001,
min_count = 5L,
threads = 1L,
encoding = "UTF-8",
embeddings = matrix(nrow = 0, ncol = dim),
...
)
Arguments
x |
a data.frame with columns doc_id and text or the path to the file on disk containing training data. |
type |
character string with the type of algorithm to use, either one of
Defaults to 'PV-DBOW'. |
dim |
dimension of the word and paragraph vectors. Defaults to 50. |
window |
skip length between words. Defaults to 10 for PV-DM and 5 for PV-DBOW |
iter |
number of training iterations. Defaults to 20. |
lr |
initial learning rate also known as alpha. Defaults to 0.05 |
hs |
logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling. |
negative |
integer with the number of negative samples. Only used in case hs is set to FALSE |
sample |
threshold for occurrence of words. Defaults to 0.001 |
min_count |
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5. |
threads |
number of CPU threads to use. Defaults to 1. |
encoding |
the encoding of |
embeddings |
optionally a matrix with pretrained word embeddings which will be used to initialise the word embedding space with (transfer learning).
The rownames of this matrix should consist of words. Only words overlapping with the vocabulary extracted from |
... |
further arguments passed on to the C++ function |
Value
an object of class paragraph2vec_trained
which is a list with elements
model: a Rcpp pointer to the model
data: a list with elements file: the training data used, n (the number of words in the training data), n_vocabulary (number of words in the vocabulary) and n_docs (number of documents)
control: a list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample
References
https://arxiv.org/pdf/1405.4053.pdf, https://groups.google.com/g/word2vec-toolkit/c/Q49FIrNOQRo/m/J6KG8mUj45sJ
See Also
predict.paragraph2vec
, as.matrix.paragraph2vec
Examples
library(tokenizers.bpe)
## Take data and standardise it a bit
data(belgium_parliament, package = "tokenizers.bpe")
str(belgium_parliament)
x <- subset(belgium_parliament, language %in% "french")
x$text <- tolower(x$text)
x$text <- gsub("[^[:alpha:]]", " ", x$text)
x$text <- gsub("[[:space:]]+", " ", x$text)
x$text <- trimws(x$text)
x$nwords <- txt_count_words(x$text)
x <- subset(x, nwords < 1000 & nchar(text) > 0)
## Build the model
model <- paragraph2vec(x = x, type = "PV-DM", dim = 15, iter = 5)
model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)
str(model)
embedding <- as.matrix(model, which = "words")
embedding <- as.matrix(model, which = "docs")
head(embedding)
## Get vocabulary
vocab <- summary(model, type = "vocabulary", which = "docs")
vocab <- summary(model, type = "vocabulary", which = "words")
## Transfer learning using existing word embeddings
library(word2vec)
w2v <- word2vec(x$text, dim = 50, type = "cbow", iter = 20, min_count = 5)
emb <- as.matrix(w2v)
model <- paragraph2vec(x = x, dim = 50, type = "PV-DM", iter = 20, min_count = 5,
embeddings = emb)
## Transfer learning - proof of concept without learning (iter=0, set to higher to learn)
emb <- matrix(rnorm(30), nrow = 2, dimnames = list(c("en", "met")))
model <- paragraph2vec(x = x, type = "PV-DM", dim = 15, iter = 0, embeddings = emb)
embedding <- as.matrix(model, which = "words", normalize = FALSE)
embedding[c("en", "met"), ]
emb