R: Train a paragraph2vec also known as doc2vec model on text

paragraph2vec {doc2vec}

R Documentation

Train a paragraph2vec also known as doc2vec model on text

Description

Construct a paragraph2vec model on text. The algorithm is explained at https://arxiv.org/pdf/1405.4053.pdf. People also refer to this model as doc2vec.
The model is an extension to the word2vec algorithm, where an additional vector for every paragraph is added directly in the training.

Usage

paragraph2vec(
  x,
  type = c("PV-DBOW", "PV-DM"),
  dim = 50,
  window = ifelse(type == "PV-DM", 5L, 10L),
  iter = 5L,
  lr = 0.05,
  hs = FALSE,
  negative = 5L,
  sample = 0.001,
  min_count = 5L,
  threads = 1L,
  encoding = "UTF-8",
  embeddings = matrix(nrow = 0, ncol = dim),
  ...
)

Arguments

`x`	a data.frame with columns doc_id and text or the path to the file on disk containing training data. Note that the text column should be of type character, should contain less than 1000 words where space or tab is used as a word separator and that the text should not contain newline characters as these are considered document delimiters.
`type`	character string with the type of algorithm to use, either one of 'PV-DM': Distributed Memory paragraph vectors 'PV-DBOW': Distributed Bag Of Words paragraph vectors Defaults to 'PV-DBOW'.
`dim`	dimension of the word and paragraph vectors. Defaults to 50.
`window`	skip length between words. Defaults to 10 for PV-DM and 5 for PV-DBOW
`iter`	number of training iterations. Defaults to 20.
`lr`	initial learning rate also known as alpha. Defaults to 0.05
`hs`	logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling.
`negative`	integer with the number of negative samples. Only used in case hs is set to FALSE
`sample`	threshold for occurrence of words. Defaults to 0.001
`min_count`	integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.
`threads`	number of CPU threads to use. Defaults to 1.
`encoding`	the encoding of `x` and `stopwords`. Defaults to 'UTF-8'. Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument is passed on to `file` when writing `x` to hard disk in case you provided it as a data.frame.
`embeddings`	optionally a matrix with pretrained word embeddings which will be used to initialise the word embedding space with (transfer learning). The rownames of this matrix should consist of words. Only words overlapping with the vocabulary extracted from `x` will be used.
`...`	further arguments passed on to the C++ function `paragraph2vec_train` - for expert use only

Value

an object of class paragraph2vec_trained which is a list with elements

model: a Rcpp pointer to the model
data: a list with elements file: the training data used, n (the number of words in the training data), n_vocabulary (number of words in the vocabulary) and n_docs (number of documents)
control: a list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample

References

https://arxiv.org/pdf/1405.4053.pdf, https://groups.google.com/g/word2vec-toolkit/c/Q49FIrNOQRo/m/J6KG8mUj45sJ

Examples


library(tokenizers.bpe)
## Take data and standardise it a bit
data(belgium_parliament, package = "tokenizers.bpe")
str(belgium_parliament)
x <- subset(belgium_parliament, language %in% "french")
x$text   <- tolower(x$text)
x$text   <- gsub("[^[:alpha:]]", " ", x$text)
x$text   <- gsub("[[:space:]]+", " ", x$text)
x$text   <- trimws(x$text)
x$nwords <- txt_count_words(x$text)
x <- subset(x, nwords < 1000 & nchar(text) > 0)

## Build the model
model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)

str(model)
embedding <- as.matrix(model, which = "words")
embedding <- as.matrix(model, which = "docs")
head(embedding)

## Get vocabulary
vocab <- summary(model, type = "vocabulary",  which = "docs")
vocab <- summary(model, type = "vocabulary",  which = "words")


## Transfer learning using existing word embeddings
library(word2vec)
w2v   <- word2vec(x$text, dim = 50, type = "cbow", iter = 20, min_count = 5)
emb   <- as.matrix(w2v)
model <- paragraph2vec(x = x, dim = 50, type = "PV-DM", iter = 20, min_count = 5, 
                       embeddings = emb)


## Transfer learning - proof of concept without learning (iter=0, set to higher to learn)
emb       <- matrix(rnorm(30), nrow = 2, dimnames = list(c("en", "met")))
model     <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 0, embeddings = emb)
embedding <- as.matrix(model, which = "words", normalize = FALSE)
embedding[c("en", "met"), ]
emb

[Package doc2vec version 0.2.0 Index]