starspace_embedding {ruimtehol} | R Documentation |
Get the document or ngram embeddings
Description
Get the document or ngram embeddings
Usage
starspace_embedding(object, x, type = c("document", "ngram"))
Arguments
object |
an object of class |
x |
character vector with text to get the embeddings
|
type |
the type of embedding requested. Either one of 'document' or 'ngram'. In case of document, the function returns the document embedding, in case of ngram the function returns the embedding of the provided ngram term. See the details section |
Details
document embeddings look to the features (e.g. words) present in
x
and summate the embeddings of these to get a document embedding and divide this embedding by size^p in case dot similarity is used and the euclidean norm in case cosine similarity is used. Where size is the number of features (e.g. words) inx
. If p=1, it's equivalent to taking average of embeddings while when p=0, it's equivalent to taking sum of embeddings. You can set p and similarity instarspace
when you train the model.for ngram embeddings, starspace is using a hashing trick to find out in which bucket the ngram lies and then retrieves the embedding of that. Note that if you specify ngram, you need to make sure
x
contains less features (e.g. words) then you've setngram
when you trained your model withstarspace
.
Value
a matrix of embeddings
Examples
data(dekamer, package = "ruimtehol")
dekamer$text <- strsplit(dekamer$question, "\\W")
dekamer$text <- lapply(dekamer$text, FUN = function(x) x[x != ""])
dekamer$text <- sapply(dekamer$text,
FUN = function(x) paste(x, collapse = " "))
set.seed(123456789)
model <- embed_tagspace(x = tolower(dekamer$text),
y = dekamer$question_theme_main,
similarity = "dot",
early_stopping = 0.8, ngram = 1, p = 0.5,
dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
embedding
colSums(embedding_dictionary[c("federale", "politie"), ]) / 2^0.5
## Not run:
set.seed(123456789)
model <- embed_tagspace(x = tolower(dekamer$text),
y = dekamer$question_theme_main,
similarity = "cosine",
early_stopping = 0.8, ngram = 1,
dim = 10, minCount = 5)
embedding <- starspace_embedding(model, "federale politie", type = "document")
embedding_dictionary <- as.matrix(model)
euclidean_norm <- function(x) sqrt(sum(x^2))
manual <- colSums(embedding_dictionary[c("federale", "politie"), ])
manual / euclidean_norm(manual)
embedding
set.seed(123456789)
model <- embed_tagspace(x = tolower(dekamer$text),
y = dekamer$question_theme_main,
similarity = "dot",
early_stopping = 0.8, ngram = 3, p = 0,
dim = 10, minCount = 5, bucket = 1)
starspace_embedding(model, "federale politie", type = "document")
starspace_embedding(model, "federale politie", type = "ngram")
## End(Not run)