| predict.paragraph2vec {doc2vec} | R Documentation | 
Predict functionalities for a paragraph2vec model
Description
Use the paragraph2vec model to
- get the embedding of documents, sentences or words 
- find the nearest documents/words which are similar to either a set of documents, words or a set of sentences containing words 
Usage
## S3 method for class 'paragraph2vec'
predict(
  object,
  newdata,
  type = c("embedding", "nearest"),
  which = c("docs", "words", "doc2doc", "word2doc", "word2word", "sent2doc"),
  top_n = 10L,
  encoding = "UTF-8",
  normalize = TRUE,
  ...
)
Arguments
| object | a paragraph2vec model as returned by  | 
| newdata | either a character vector of words, a character vector of doc_id's or a list of sentences
where the list elements are words part of the model dictionary. What needs to be provided depends on the argument you provide in  | 
| type | either 'embedding' or 'nearest' to get the embeddings or to find the closest text items. Defaults to 'nearest'. | 
| which | either one of 'docs', 'words', 'doc2doc', 'word2doc', 'word2word' or 'sent2doc' where 
 | 
| top_n | show only the top n nearest neighbours. Defaults to 10, with a maximum value of 100. Only used for  | 
| encoding | set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'. | 
| normalize | logical indicating to normalize the embeddings. Defaults to  | 
| ... | not used | 
Value
depending on the type, you get a different output:
- for type nearest: returns a list of data.frames with columns term1, term2, similarity and rank indicating the elements which are closest to the provided - newdata
- for type embedding: a matrix of embeddings of the words/documents or sentences provided in - newdata, rownames are either taken from the words/documents or list names of the sentences. The matrix has always the same number of rows as the length of- newdata, possibly with NA values if the word/doc_id is not part of the dictionary
See the examples.
See Also
paragraph2vec, read.paragraph2vec
Examples
library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- belgium_parliament
x <- subset(x, language %in% "dutch")
x <- subset(x, nchar(text) > 0 & txt_count_words(text) < 1000)
x$doc_id <- sprintf("doc_%s", 1:nrow(x))
x$text   <- tolower(x$text)
x$text   <- gsub("[^[:alpha:]]", " ", x$text)
x$text   <- gsub("[[:space:]]+", " ", x$text)
x$text   <- trimws(x$text)
## Build model
model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)
model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)
sentences <- list(
  example = c("geld", "diabetes"),
  hi = c("geld", "diabetes", "koning"),
  test = c("geld"),
  nothing = character(), 
  repr = c("geld", "diabetes", "koning"))
  
## Get embeddings (type =  'embedding')
predict(model, newdata = c("geld", "koning", "unknownword", NA, "</s>", ""), 
               type = "embedding", which = "words")
predict(model, newdata = c("doc_1", "doc_10", "unknowndoc", NA, "</s>"), 
               type = "embedding", which = "docs")
predict(model, sentences, type = "embedding")
## Get most similar items (type =  'nearest')
predict(model, newdata = c("doc_1", "doc_10"), type = "nearest", which = "doc2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2word")
predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 7)
## Similar way on extracting similarities
emb <- predict(model, sentences, type = "embedding")
emb_docs <- as.matrix(model, type = "docs")
paragraph2vec_similarity(emb, emb_docs, top_n = 3)