text_to_vec {PsychWordVec} | R Documentation |
Extract contextualized word embeddings from transformers (pre-trained language models).
Description
Extract hidden layers from a language model and aggregate them to
get token (roughly word) embeddings and text embeddings
(all reshaped to embed
matrix).
It is a wrapper function of text::textEmbed()
.
Usage
text_to_vec(
text,
model,
layers = "all",
layer.to.token = "concatenate",
token.to.word = TRUE,
token.to.text = TRUE,
encoding = "UTF-8",
...
)
Arguments
text |
Can be:
|
model |
Model name at HuggingFace.
See |
layers |
Layers to be extracted from the |
layer.to.token |
Method to aggregate hidden layers to each token.
Defaults to |
token.to.word |
Aggregate subword token embeddings (if whole word is out of vocabulary)
to whole word embeddings. Defaults to |
token.to.text |
Aggregate token embeddings to each text.
Defaults to |
encoding |
Text encoding (only used if |
... |
Other parameters passed to
|
Value
A list
of:
token.embed
-
Token (roughly word) embeddings
text.embed
-
Text embeddings, aggregated from token embeddings
See Also
Examples
## Not run:
# text_init() # initialize the environment
text = c("Download models from HuggingFace",
"Chinese are East Asian",
"Beijing is the capital of China")
embed = text_to_vec(text, model="bert-base-cased", layers=c(0, 12))
embed
embed1 = embed$token.embed[[1]]
embed2 = embed$token.embed[[2]]
embed3 = embed$token.embed[[3]]
View(embed1)
View(embed2)
View(embed3)
View(embed$text.embed)
plot_similarity(embed1, value.color="grey")
plot_similarity(embed2, value.color="grey")
plot_similarity(embed3, value.color="grey")
plot_similarity(rbind(embed1, embed2, embed3))
## End(Not run)