R: Predict functionalities for a paragraph2vec model

predict.paragraph2vec {doc2vec}

R Documentation

Predict functionalities for a paragraph2vec model

Description

Use the paragraph2vec model to

get the embedding of documents, sentences or words
find the nearest documents/words which are similar to either a set of documents, words or a set of sentences containing words

Usage

## S3 method for class 'paragraph2vec'
predict(
  object,
  newdata,
  type = c("embedding", "nearest"),
  which = c("docs", "words", "doc2doc", "word2doc", "word2word", "sent2doc"),
  top_n = 10L,
  encoding = "UTF-8",
  normalize = TRUE,
  ...
)

Arguments

`object`	a paragraph2vec model as returned by `paragraph2vec` or `read.paragraph2vec`
`newdata`	either a character vector of words, a character vector of doc_id's or a list of sentences where the list elements are words part of the model dictionary. What needs to be provided depends on the argument you provide in `which`. See the examples.
`type`	either 'embedding' or 'nearest' to get the embeddings or to find the closest text items. Defaults to 'nearest'.
`which`	either one of 'docs', 'words', 'doc2doc', 'word2doc', 'word2word' or 'sent2doc' where 'docs' or 'words' can be chosen if `type` is set to 'embedding' to indicate that `newdata` contains either doc_id's or words 'doc2doc', 'word2doc', 'word2word', 'sent2doc' can be chosen if `type` is set to 'nearest' indicating to extract respectively the closest document to a document (doc2doc), the closest document to a word (word2doc), the closest word to a word (word2word) or the closest document to sentences (sent2doc).
`top_n`	show only the top n nearest neighbours. Defaults to 10, with a maximum value of 100. Only used for `type` 'nearest'.
`encoding`	set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'.
`normalize`	logical indicating to normalize the embeddings. Defaults to `TRUE`. Only used for `type` 'embedding'.
`...`	not used

Value

depending on the type, you get a different output:

for type nearest: returns a list of data.frames with columns term1, term2, similarity and rank indicating the elements which are closest to the provided newdata
for type embedding: a matrix of embeddings of the words/documents or sentences provided in newdata, rownames are either taken from the words/documents or list names of the sentences. The matrix has always the same number of rows as the length of newdata, possibly with NA values if the word/doc_id is not part of the dictionary

See the examples.

Examples


library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- belgium_parliament
x <- subset(x, language %in% "dutch")
x <- subset(x, nchar(text) > 0 & txt_count_words(text) < 1000)
x$doc_id <- sprintf("doc_%s", 1:nrow(x))
x$text   <- tolower(x$text)
x$text   <- gsub("[^[:alpha:]]", " ", x$text)
x$text   <- gsub("[[:space:]]+", " ", x$text)
x$text   <- trimws(x$text)

## Build model
model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)


sentences <- list(
  example = c("geld", "diabetes"),
  hi = c("geld", "diabetes", "koning"),
  test = c("geld"),
  nothing = character(), 
  repr = c("geld", "diabetes", "koning"))
  
## Get embeddings (type =  'embedding')
predict(model, newdata = c("geld", "koning", "unknownword", NA, "</s>", ""), 
               type = "embedding", which = "words")
predict(model, newdata = c("doc_1", "doc_10", "unknowndoc", NA, "</s>"), 
               type = "embedding", which = "docs")
predict(model, sentences, type = "embedding")

## Get most similar items (type =  'nearest')
predict(model, newdata = c("doc_1", "doc_10"), type = "nearest", which = "doc2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2word")
predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 7)

## Similar way on extracting similarities
emb <- predict(model, sentences, type = "embedding")
emb_docs <- as.matrix(model, type = "docs")
paragraph2vec_similarity(emb, emb_docs, top_n = 3)

[Package doc2vec version 0.2.0 Index]

Predict functionalities for a paragraph2vec model

Description

Usage

Arguments

Value

See Also

Examples