R: Train static word embeddings using the Word2Vec, GloVe, or...

train_wordvec {PsychWordVec}

R Documentation

Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm.

Description

Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm with multi-threading.

Usage

train_wordvec(
  text,
  method = c("word2vec", "glove", "fasttext"),
  dims = 300,
  window = 5,
  min.freq = 5,
  threads = 8,
  model = c("skip-gram", "cbow"),
  loss = c("ns", "hs"),
  negative = 5,
  subsample = 1e-04,
  learning = 0.05,
  ngrams = c(3, 6),
  x.max = 10,
  convergence = -1,
  stopwords = character(0),
  encoding = "UTF-8",
  tolower = FALSE,
  normalize = FALSE,
  iteration,
  tokenizer,
  remove,
  file.save,
  compress = "bzip2",
  verbose = TRUE
)

Arguments

`text`	A character vector of text, or a file path on disk containing text.
`method`	Training algorithm: `"word2vec"` (default): using the `word2vec` package `"glove"`: using the `rsparse` and `text2vec` packages `"fasttext"`: using the `fastTextR` package
`dims`	Number of dimensions of word vectors to be trained. Common choices include 50, 100, 200, 300, and 500. Defaults to `300`.
`window`	Window size (number of nearby words behind/ahead the current word). It defines how many surrounding words to be included in training: [window] words behind and [window] words ahead ([window]*2 in total). Defaults to `5`.
`min.freq`	Minimum frequency of words to be included in training. Words that appear less than this value of times will be excluded from vocabulary. Defaults to `5` (take words that appear at least five times).
`threads`	Number of CPU threads used for training. A modest value produces the fastest training. Too many threads are not always helpful. Defaults to `8`.
`model`	<Only for Word2Vec / FastText> Learning model architecture: `"skip-gram"` (default): Skip-Gram, which predicts surrounding words given the current word `"cbow"`: Continuous Bag-of-Words, which predicts the current word based on the context
`loss`	<Only for Word2Vec / FastText> Loss function (computationally efficient approximation): `"ns"` (default): Negative Sampling `"hs"`: Hierarchical Softmax
`negative`	<Only for Negative Sampling in Word2Vec / FastText> Number of negative examples. Values in the range 5~20 are useful for small training datasets, while for large datasets the value can be as small as 2~5. Defaults to `5`.
`subsample`	<Only for Word2Vec / FastText> Subsampling of frequent words (threshold for occurrence of words). Those that appear with higher frequency in the training data will be randomly down-sampled. Defaults to `0.0001` (`1e-04`).
`learning`	<Only for Word2Vec / FastText> Initial (starting) learning rate, also known as alpha. Defaults to `0.05`.
`ngrams`	<Only for FastText> Minimal and maximal ngram length. Defaults to `c(3, 6)`.
`x.max`	<Only for GloVe> Maximum number of co-occurrences to use in the weighting function. Defaults to `10`.
`convergence`	<Only for GloVe> Convergence tolerance for SGD iterations. Defaults to `-1`.
`stopwords`	<Only for Word2Vec / GloVe> A character vector of stopwords to be excluded from training.
`encoding`	Text encoding. Defaults to `"UTF-8"`.
`tolower`	Convert all upper-case characters to lower-case? Defaults to `FALSE`.
`normalize`	Normalize all word vectors to unit length? Defaults to `FALSE`. See `normalize`.
`iteration`	Number of training iterations. More iterations makes a more precise model, but computational cost is linearly proportional to iterations. Defaults to `5` for Word2Vec and FastText while `10` for GloVe.
`tokenizer`	Function used to tokenize the text. Defaults to `text2vec::word_tokenizer`.
`remove`	Strings (in regular expression) to be removed from the text. Defaults to `"_\|'\|<br/>\|<br />\|e\\.g\\.\|i\\.e\\."`. You may turn off this by specifying `remove=NULL`.
`file.save`	File name of to-be-saved R data (must be .RData).
`compress`	Compression method for the saved file. Defaults to `"bzip2"`. Options include: `1` or `"gzip"`: modest file size (fastest) `2` or `"bzip2"`: small file size (fast) `3` or `"xz"`: minimized file size (slow)
`verbose`	Print information to the console? Defaults to `TRUE`.

Value

A wordvec (data.table) with three variables: word, vec, freq.

Download

Download pre-trained word vectors data (.RData): https://psychbruce.github.io/WordVector_RData.pdf

References

All-in-one package:

https://CRAN.R-project.org/package=wordsalad

Word2Vec:

GloVe:

FastText:

Examples

review = text2vec::movie_review  # a data.frame'
text = review$review

## Note: All the examples train 50 dims for faster code check.

## Word2Vec (SGNS)
dt1 = train_wordvec(
  text,
  method="word2vec",
  model="skip-gram",
  dims=50, window=5,
  normalize=TRUE)

dt1
most_similar(dt1, "Ive")  # evaluate performance
most_similar(dt1, ~ man - he + she, topn=5)  # evaluate performance
most_similar(dt1, ~ boy - he + she, topn=5)  # evaluate performance

## GloVe
dt2 = train_wordvec(
  text,
  method="glove",
  dims=50, window=5,
  normalize=TRUE)

dt2
most_similar(dt2, "Ive")  # evaluate performance
most_similar(dt2, ~ man - he + she, topn=5)  # evaluate performance
most_similar(dt2, ~ boy - he + she, topn=5)  # evaluate performance

## FastText
dt3 = train_wordvec(
  text,
  method="fasttext",
  model="skip-gram",
  dims=50, window=5,
  normalize=TRUE)

dt3
most_similar(dt3, "Ive")  # evaluate performance
most_similar(dt3, ~ man - he + she, topn=5)  # evaluate performance
most_similar(dt3, ~ boy - he + she, topn=5)  # evaluate performance

[Package PsychWordVec version 2023.9 Index]