R: Extract word vectors from fasttext word embedding

fasttext {wordsalad}

R Documentation

Extract word vectors from fasttext word embedding

Description

The calculations are done with the fastTextR package.

Usage

fasttext(
  text,
  tokenizer = text2vec::space_tokenizer,
  dim = 10L,
  type = c("skip-gram", "cbow"),
  window = 5L,
  loss = "hs",
  negative = 5L,
  n_iter = 5L,
  min_count = 5L,
  threads = 1L,
  composition = c("tibble", "data.frame", "matrix"),
  verbose = FALSE
)

Arguments

`text`	Character string.
`tokenizer`	Function, function to perform tokenization. Defaults to text2vec::space_tokenizer.
`dim`	Integer, number of dimension of the resulting word vectors.
`type`	Character, the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'skip-gram'.
`window`	Integer, skip length between words. Defaults to 5.
`loss`	Charcter, choice of loss function must be one of "ns", "hs", or "softmax". See details for more Defaults to "hs".
`negative`	integer with the number of negative samples. Only used when loss = "ns".
`n_iter`	Integer, number of training iterations. Defaults to 5. `numeric = -1` defines early stopping strategy. Stop fitting when one of two following conditions will be satisfied: (a) passed all iterations (b) `cost_previous_iter / cost_current_iter - 1 < convergence_tol`. Defaults to -1.
`min_count`	Integer, number of times a token should appear to be considered in the model. Defaults to 5.
`threads`	number of CPU threads to use. Defaults to 1.
`composition`	Character, Either "tibble", "matrix", or "data.frame" for the format out the resulting word vectors.
`verbose`	Logical, controls whether progress is reported as operations are executed.

Details

The choice of loss functions are one of:

"ns" negative sampling
"hs" hierarchical softmax
"softmax" full softmax

Value

A tibble, data.frame or matrix containing the token in the first column and word vectors in the remaining columns.

Source

https://fasttext.cc/

References

Enriching Word Vectors with Subword Information, 2016, P. Bojanowski, E. Grave, A. Joulin, T. Mikolov.

Examples

fasttext(fairy_tales, n_iter = 2)

# Custom tokenizer that splits on non-alphanumeric characters
fasttext(fairy_tales,
         n_iter = 2,
         tokenizer = function(x) strsplit(x, "[^[:alnum:]]+"))

[Package wordsalad version 0.2.0 Index]