predict.paragraph2vec {doc2vec}R Documentation

Predict functionalities for a paragraph2vec model

Description

Use the paragraph2vec model to

Usage

## S3 method for class 'paragraph2vec'
predict(
  object,
  newdata,
  type = c("embedding", "nearest"),
  which = c("docs", "words", "doc2doc", "word2doc", "word2word", "sent2doc"),
  top_n = 10L,
  encoding = "UTF-8",
  normalize = TRUE,
  ...
)

Arguments

object

a paragraph2vec model as returned by paragraph2vec or read.paragraph2vec

newdata

either a character vector of words, a character vector of doc_id's or a list of sentences where the list elements are words part of the model dictionary. What needs to be provided depends on the argument you provide in which. See the examples.

type

either 'embedding' or 'nearest' to get the embeddings or to find the closest text items. Defaults to 'nearest'.

which

either one of 'docs', 'words', 'doc2doc', 'word2doc', 'word2word' or 'sent2doc' where

  • 'docs' or 'words' can be chosen if type is set to 'embedding' to indicate that newdata contains either doc_id's or words

  • 'doc2doc', 'word2doc', 'word2word', 'sent2doc' can be chosen if type is set to 'nearest' indicating to extract respectively the closest document to a document (doc2doc), the closest document to a word (word2doc), the closest word to a word (word2word) or the closest document to sentences (sent2doc).

top_n

show only the top n nearest neighbours. Defaults to 10, with a maximum value of 100. Only used for type 'nearest'.

encoding

set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'.

normalize

logical indicating to normalize the embeddings. Defaults to TRUE. Only used for type 'embedding'.

...

not used

Value

depending on the type, you get a different output:

See the examples.

See Also

paragraph2vec, read.paragraph2vec

Examples


library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- belgium_parliament
x <- subset(x, language %in% "dutch")
x <- subset(x, nchar(text) > 0 & txt_count_words(text) < 1000)
x$doc_id <- sprintf("doc_%s", 1:nrow(x))
x$text   <- tolower(x$text)
x$text   <- gsub("[^[:alpha:]]", " ", x$text)
x$text   <- gsub("[[:space:]]+", " ", x$text)
x$text   <- trimws(x$text)

## Build model
model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)


sentences <- list(
  example = c("geld", "diabetes"),
  hi = c("geld", "diabetes", "koning"),
  test = c("geld"),
  nothing = character(), 
  repr = c("geld", "diabetes", "koning"))
  
## Get embeddings (type =  'embedding')
predict(model, newdata = c("geld", "koning", "unknownword", NA, "</s>", ""), 
               type = "embedding", which = "words")
predict(model, newdata = c("doc_1", "doc_10", "unknowndoc", NA, "</s>"), 
               type = "embedding", which = "docs")
predict(model, sentences, type = "embedding")

## Get most similar items (type =  'nearest')
predict(model, newdata = c("doc_1", "doc_10"), type = "nearest", which = "doc2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2word")
predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 7)

## Similar way on extracting similarities
emb <- predict(model, sentences, type = "embedding")
emb_docs <- as.matrix(model, type = "docs")
paragraph2vec_similarity(emb, emb_docs, top_n = 3)


[Package doc2vec version 0.2.0 Index]