R: Extract word vectors from word2vec word embedding

word2vec {wordsalad}

R Documentation

Extract word vectors from word2vec word embedding

Description

The calculations are done with the word2vec package.

Usage

word2vec(
  text,
  tokenizer = text2vec::space_tokenizer,
  dim = 50,
  type = c("cbow", "skip-gram"),
  window = 5L,
  min_count = 5L,
  loss = c("ns", "hs"),
  negative = 5L,
  n_iter = 5L,
  lr = 0.05,
  sample = 0.001,
  stopwords = character(),
  threads = 1L,
  collapse_character = "\t",
  composition = c("tibble", "data.frame", "matrix")
)

Arguments

`text`	Character string.
`tokenizer`	Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.
`dim`	dimension of the word vectors. Defaults to 50.
`type`	the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow'
`window`	skip length between words. Defaults to 5.
`min_count`	integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.
`loss`	Charcter, choice of loss function must be one of "ns" or "hs". See detaulsfor more Defaults to "ns".
`negative`	integer with the number of negative samples. Only used in case hs is set to FALSE
`n_iter`	Integer, number of training iterations. Defaults to 5.
`lr`	initial learning rate also known as alpha. Defaults to 0.05
`sample`	threshold for occurrence of words. Defaults to 0.001
`stopwords`	a character vector of stopwords to exclude from training
`threads`	number of CPU threads to use. Defaults to 1.
`collapse_character`	Character vector with length 1. Character used to glue together tokens after tokenizing. See details for more information. Defaults to `"\t"`.
`composition`	Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.

Details

A trade-off have been made to allow for an arbitrary tokenizing function. The text is first passed through the tokenizer. Then it is being collapsed back together into strings using collapse_character as the separator. You need to pick collapse_character to be a character that will not appear in any of the tokens after tokenizing is done. The default value is a "tab" character. If you pick a character that is present in the tokens then those words will be split.

The choice of loss functions are one of:

"ns" negative sampling
"hs" hierarchical softmax

Value

A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.

Source

https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

References

Mikolov, Tomas and Sutskever, Ilya and Chen, Kai and Corrado, Greg S and Dean, Jeff. 2013. Distributed Representations of Words and Phrases and their Compositionality

Examples

word2vec(fairy_tales)

# Custom tokenizer that splits on non-alphanumeric characters
word2vec(fairy_tales, tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))

[Package wordsalad version 0.2.0 Index]