train_wordvec {PsychWordVec} | R Documentation |
Train static word embeddings using the Word2Vec, GloVe, or FastText algorithm.
Description
Train static word embeddings using the
Word2Vec
,
GloVe
, or
FastText
algorithm
with multi-threading.
Usage
train_wordvec(
text,
method = c("word2vec", "glove", "fasttext"),
dims = 300,
window = 5,
min.freq = 5,
threads = 8,
model = c("skip-gram", "cbow"),
loss = c("ns", "hs"),
negative = 5,
subsample = 1e-04,
learning = 0.05,
ngrams = c(3, 6),
x.max = 10,
convergence = -1,
stopwords = character(0),
encoding = "UTF-8",
tolower = FALSE,
normalize = FALSE,
iteration,
tokenizer,
remove,
file.save,
compress = "bzip2",
verbose = TRUE
)
Arguments
text |
A character vector of text, or a file path on disk containing text. |
method |
Training algorithm: |
dims |
Number of dimensions of word vectors to be trained.
Common choices include 50, 100, 200, 300, and 500.
Defaults to |
window |
Window size (number of nearby words behind/ahead the current word).
It defines how many surrounding words to be included in training:
[window] words behind and [window] words ahead ([window]*2 in total).
Defaults to |
min.freq |
Minimum frequency of words to be included in training.
Words that appear less than this value of times will be excluded from vocabulary.
Defaults to |
threads |
Number of CPU threads used for training.
A modest value produces the fastest training.
Too many threads are not always helpful.
Defaults to |
model |
<Only for Word2Vec / FastText> Learning model architecture:
|
loss |
<Only for Word2Vec / FastText> Loss function (computationally efficient approximation):
|
negative |
<Only for Negative Sampling in Word2Vec / FastText> Number of negative examples.
Values in the range 5~20 are useful for small training datasets,
while for large datasets the value can be as small as 2~5.
Defaults to |
subsample |
<Only for Word2Vec / FastText> Subsampling of frequent words (threshold for occurrence of words).
Those that appear with higher frequency in the training data will be randomly down-sampled.
Defaults to |
learning |
<Only for Word2Vec / FastText> Initial (starting) learning rate, also known as alpha.
Defaults to |
ngrams |
<Only for FastText> Minimal and maximal ngram length.
Defaults to |
x.max |
<Only for GloVe> Maximum number of co-occurrences to use in the weighting function.
Defaults to |
convergence |
<Only for GloVe> Convergence tolerance for SGD iterations. Defaults to |
stopwords |
<Only for Word2Vec / GloVe> A character vector of stopwords to be excluded from training. |
encoding |
Text encoding. Defaults to |
tolower |
Convert all upper-case characters to lower-case?
Defaults to |
normalize |
Normalize all word vectors to unit length?
Defaults to |
iteration |
Number of training iterations.
More iterations makes a more precise model,
but computational cost is linearly proportional to iterations.
Defaults to |
tokenizer |
Function used to tokenize the text.
Defaults to |
remove |
Strings (in regular expression) to be removed from the text.
Defaults to |
file.save |
File name of to-be-saved R data (must be .RData). |
compress |
Compression method for the saved file. Defaults to Options include:
|
verbose |
Print information to the console? Defaults to |
Value
A wordvec
(data.table) with three variables:
word
, vec
, freq
.
Download
Download pre-trained word vectors data (.RData
):
https://psychbruce.github.io/WordVector_RData.pdf
References
All-in-one package:
Word2Vec:
GloVe:
FastText:
See Also
Examples
review = text2vec::movie_review # a data.frame'
text = review$review
## Note: All the examples train 50 dims for faster code check.
## Word2Vec (SGNS)
dt1 = train_wordvec(
text,
method="word2vec",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
dt1
most_similar(dt1, "Ive") # evaluate performance
most_similar(dt1, ~ man - he + she, topn=5) # evaluate performance
most_similar(dt1, ~ boy - he + she, topn=5) # evaluate performance
## GloVe
dt2 = train_wordvec(
text,
method="glove",
dims=50, window=5,
normalize=TRUE)
dt2
most_similar(dt2, "Ive") # evaluate performance
most_similar(dt2, ~ man - he + she, topn=5) # evaluate performance
most_similar(dt2, ~ boy - he + she, topn=5) # evaluate performance
## FastText
dt3 = train_wordvec(
text,
method="fasttext",
model="skip-gram",
dims=50, window=5,
normalize=TRUE)
dt3
most_similar(dt3, "Ive") # evaluate performance
most_similar(dt3, ~ man - he + she, topn=5) # evaluate performance
most_similar(dt3, ~ boy - he + she, topn=5) # evaluate performance