fasttext {wordsalad} | R Documentation |
Extract word vectors from fasttext word embedding
Description
The calculations are done with the fastTextR package.
Usage
fasttext(
text,
tokenizer = text2vec::space_tokenizer,
dim = 10L,
type = c("skip-gram", "cbow"),
window = 5L,
loss = "hs",
negative = 5L,
n_iter = 5L,
min_count = 5L,
threads = 1L,
composition = c("tibble", "data.frame", "matrix"),
verbose = FALSE
)
Arguments
text |
Character string. |
tokenizer |
Function, function to perform tokenization. Defaults to text2vec::space_tokenizer. |
dim |
Integer, number of dimension of the resulting word vectors. |
type |
Character, the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'skip-gram'. |
window |
Integer, skip length between words. Defaults to 5. |
loss |
Charcter, choice of loss function must be one of "ns", "hs", or "softmax". See details for more Defaults to "hs". |
negative |
integer with the number of negative samples. Only used when loss = "ns". |
n_iter |
Integer, number of training iterations. Defaults to 5.
|
min_count |
Integer, number of times a token should appear to be considered in the model. Defaults to 5. |
threads |
number of CPU threads to use. Defaults to 1. |
composition |
Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors. |
verbose |
Logical, controls whether progress is reported as operations are executed. |
Details
The choice of loss functions are one of:
"ns" negative sampling
"hs" hierarchical softmax
"softmax" full softmax
Value
A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.
Source
References
Enriching Word Vectors with Subword Information, 2016, P. Bojanowski, E. Grave, A. Joulin, T. Mikolov.
Examples
fasttext(fairy_tales, n_iter = 2)
# Custom tokenizer that splits on non-alphanumeric characters
fasttext(fairy_tales,
n_iter = 2,
tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))