doc2vec {word2vec} | R Documentation |
Get document vectors based on a word2vec model
Description
Document vectors are the sum of the vectors of the words which are part of the document standardised by the scale of the vector space. This scale is the sqrt of the average inner product of the vector elements.
Usage
doc2vec(object, newdata, split = " ", encoding = "UTF-8", ...)
Arguments
object |
a word2vec model as returned by |
newdata |
either a list of tokens where each list element is a character vector of tokens which form the document and the list name is considered the document identifier; or a data.frame with columns doc_id and text; or a character vector with texts where the character vector names will be considered the document identifier |
split |
in case |
encoding |
set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'. |
... |
not used |
Value
a matrix with 1 row per document containing the text document vectors, the rownames of this matrix are the document identifiers
See Also
Examples
path <- system.file(package = "word2vec", "models", "example.bin")
model <- read.word2vec(path)
x <- data.frame(doc_id = c("doc1", "doc2", "testmissingdata"),
text = c("there is no toilet. on the bus", "no tokens from dictionary", NA),
stringsAsFactors = FALSE)
emb <- doc2vec(model, x, type = "embedding")
emb
newdoc <- doc2vec(model, "i like busses with a toilet")
word2vec_similarity(emb, newdoc)
## similar way of extracting embeddings
x <- setNames(object = c("there is no toilet. on the bus", "no tokens from dictionary", NA),
nm = c("a", "b", "c"))
emb <- doc2vec(model, x, type = "embedding")
emb
## similar way of extracting embeddings
x <- setNames(object = c("there is no toilet. on the bus", "no tokens from dictionary", NA),
nm = c("a", "b", "c"))
x <- strsplit(x, "[ .]")
emb <- doc2vec(model, x, type = "embedding")
emb
## show behaviour in case of NA or character data of no length
x <- list(a = character(), b = c("bus", "toilet"), c = NA)
emb <- doc2vec(model, x, type = "embedding")
emb