h2o.word2vec {h2o} | R Documentation |
Trains a word2vec model on a String column of an H2O data frame
Description
Trains a word2vec model on a String column of an H2O data frame
Usage
h2o.word2vec(
training_frame = NULL,
model_id = NULL,
min_word_freq = 5,
word_model = c("SkipGram", "CBOW"),
norm_model = c("HSM"),
vec_size = 100,
window_size = 5,
sent_sample_rate = 0.001,
init_learning_rate = 0.025,
epochs = 5,
pre_trained = NULL,
max_runtime_secs = 0,
export_checkpoints_dir = NULL
)
Arguments
training_frame |
Id of the training data frame. |
model_id |
Destination id for this model; auto-generated if not specified. |
min_word_freq |
This will discard words that appear less than <int> times Defaults to 5. |
word_model |
The word model to use (SkipGram or CBOW) Must be one of: "SkipGram", "CBOW". Defaults to SkipGram. |
norm_model |
Use Hierarchical Softmax Must be one of: "HSM". Defaults to HSM. |
vec_size |
Set size of word vectors Defaults to 100. |
window_size |
Set max skip length between words Defaults to 5. |
sent_sample_rate |
Set threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; useful range is (0, 1e-5) Defaults to 0.001. |
init_learning_rate |
Set the starting learning rate Defaults to 0.025. |
epochs |
Number of training iterations to run Defaults to 5. |
pre_trained |
Id of a data frame that contains a pre-trained (external) word2vec model |
max_runtime_secs |
Maximum allowed runtime in seconds for model training. Use 0 to disable. Defaults to 0. |
export_checkpoints_dir |
Automatically export generated models to this directory. |
Examples
## Not run:
library(h2o)
h2o.init()
# Import the CraigslistJobTitles dataset
job_titles <- h2o.importFile(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/craigslistJobTitles.csv",
col.names = c("category", "jobtitle"), col.types = c("String", "String"), header = TRUE
)
# Build and train the Word2Vec model
words <- h2o.tokenize(job_titles, " ")
vec <- h2o.word2vec(training_frame = words)
h2o.findSynonyms(vec, "teacher", count = 20)
## End(Not run)