word2vec {wordsalad} | R Documentation |
Extract word vectors from word2vec word embedding
Description
The calculations are done with the word2vec package.
Usage
word2vec(
text,
tokenizer = text2vec::space_tokenizer,
dim = 50,
type = c("cbow", "skip-gram"),
window = 5L,
min_count = 5L,
loss = c("ns", "hs"),
negative = 5L,
n_iter = 5L,
lr = 0.05,
sample = 0.001,
stopwords = character(),
threads = 1L,
collapse_character = "\t",
composition = c("tibble", "data.frame", "matrix")
)
Arguments
text |
Character string. |
tokenizer |
Function, function to perform tokenization. Defaults to text2vec::space_tokenizer. |
dim |
dimension of the word vectors. Defaults to 50. |
type |
the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow' |
window |
skip length between words. Defaults to 5. |
min_count |
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5. |
loss |
Charcter, choice of loss function must be one of "ns" or "hs". See detaulsfor more Defaults to "ns". |
negative |
integer with the number of negative samples. Only used in case hs is set to FALSE |
n_iter |
Integer, number of training iterations. Defaults to 5. |
lr |
initial learning rate also known as alpha. Defaults to 0.05 |
sample |
threshold for occurrence of words. Defaults to 0.001 |
stopwords |
a character vector of stopwords to exclude from training |
threads |
number of CPU threads to use. Defaults to 1. |
collapse_character |
Character vector with length 1. Character used to
glue together tokens after tokenizing. See details for more information.
Defaults to |
composition |
Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors. |
Details
A trade-off have been made to allow for an arbitrary tokenizing function. The
text is first passed through the tokenizer. Then it is being collapsed back
together into strings using collapse_character
as the separator. You
need to pick collapse_character
to be a character that will not appear
in any of the tokens after tokenizing is done. The default value is a "tab"
character. If you pick a character that is present in the tokens then those
words will be split.
The choice of loss functions are one of:
"ns" negative sampling
"hs" hierarchical softmax
Value
A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.
Source
References
Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff. 2013. Distributed Representations of Words and Phrases and their Compositionality
Examples
word2vec(fairy_tales)
# Custom tokenizer that splits on non-alphanumeric characters
word2vec(fairy_tales, tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))