predict.paragraph2vec {doc2vec} | R Documentation |
Predict functionalities for a paragraph2vec model
Description
Use the paragraph2vec model to
get the embedding of documents, sentences or words
find the nearest documents/words which are similar to either a set of documents, words or a set of sentences containing words
Usage
## S3 method for class 'paragraph2vec'
predict(
object,
newdata,
type = c("embedding", "nearest"),
which = c("docs", "words", "doc2doc", "word2doc", "word2word", "sent2doc"),
top_n = 10L,
encoding = "UTF-8",
normalize = TRUE,
...
)
Arguments
object |
a paragraph2vec model as returned by |
newdata |
either a character vector of words, a character vector of doc_id's or a list of sentences
where the list elements are words part of the model dictionary. What needs to be provided depends on the argument you provide in |
type |
either 'embedding' or 'nearest' to get the embeddings or to find the closest text items. Defaults to 'nearest'. |
which |
either one of 'docs', 'words', 'doc2doc', 'word2doc', 'word2word' or 'sent2doc' where
|
top_n |
show only the top n nearest neighbours. Defaults to 10, with a maximum value of 100. Only used for |
encoding |
set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'. |
normalize |
logical indicating to normalize the embeddings. Defaults to |
... |
not used |
Value
depending on the type, you get a different output:
for type nearest: returns a list of data.frames with columns term1, term2, similarity and rank indicating the elements which are closest to the provided
newdata
for type embedding: a matrix of embeddings of the words/documents or sentences provided in
newdata
, rownames are either taken from the words/documents or list names of the sentences. The matrix has always the same number of rows as the length ofnewdata
, possibly with NA values if the word/doc_id is not part of the dictionary
See the examples.
See Also
paragraph2vec
, read.paragraph2vec
Examples
library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- belgium_parliament
x <- subset(x, language %in% "dutch")
x <- subset(x, nchar(text) > 0 & txt_count_words(text) < 1000)
x$doc_id <- sprintf("doc_%s", 1:nrow(x))
x$text <- tolower(x$text)
x$text <- gsub("[^[:alpha:]]", " ", x$text)
x$text <- gsub("[[:space:]]+", " ", x$text)
x$text <- trimws(x$text)
## Build model
model <- paragraph2vec(x = x, type = "PV-DM", dim = 15, iter = 5)
model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)
sentences <- list(
example = c("geld", "diabetes"),
hi = c("geld", "diabetes", "koning"),
test = c("geld"),
nothing = character(),
repr = c("geld", "diabetes", "koning"))
## Get embeddings (type = 'embedding')
predict(model, newdata = c("geld", "koning", "unknownword", NA, "</s>", ""),
type = "embedding", which = "words")
predict(model, newdata = c("doc_1", "doc_10", "unknowndoc", NA, "</s>"),
type = "embedding", which = "docs")
predict(model, sentences, type = "embedding")
## Get most similar items (type = 'nearest')
predict(model, newdata = c("doc_1", "doc_10"), type = "nearest", which = "doc2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2word")
predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 7)
## Similar way on extracting similarities
emb <- predict(model, sentences, type = "embedding")
emb_docs <- as.matrix(model, type = "docs")
paragraph2vec_similarity(emb, emb_docs, top_n = 3)