starspace {ruimtehol}R Documentation

Interface to Starspace for training a Starspace model

Description

Interface to Starspace for training a Starspace model, providing raw access to the C++ functionality.

Usage

starspace(
  model = "textspace.bin",
  file,
  trainMode = 0,
  fileFormat = c("fastText", "labelDoc"),
  label = "__label__",
  dim = 100,
  epoch = 5,
  lr = 0.01,
  loss = c("hinge", "softmax"),
  margin = 0.05,
  similarity = c("cosine", "dot"),
  negSearchLimit = 50,
  adagrad = TRUE,
  ws = 5,
  minCount = 1,
  minCountLabel = 1,
  ngrams = 1,
  thread = 1,
  ...
)

Arguments

model

the full path to where the model file will be saved. Defaults to 'textspace.bin'.

file

the full path to the file on disk which will be used for training.

trainMode

integer with the training mode. Possible values are 0, 1, 2, 3, 4 or 5. Defaults to 0. The use cases are

  • 0: tagspace (classification tasks) and search tasks

  • 1: pagespace & docspace (interest-based or content-based recommendation)

  • 2: articlespace (sentences within document)

  • 3: sentence embeddings and entity similarity

  • 4: multi-relational graphs

  • 5: word embeddings

fileFormat

either one of 'fastText' or 'labelDoc'. See the documentation of StarSpace

label

labels prefix (character string identifying how a label is prefixed, defaults to '__label__')

dim

the size of the embedding vectors (integer, defaults to 100)

epoch

number of epochs (integer, defaults to 5)

lr

learning rate (numeric, defaults to 0.01)

loss

loss function (either 'hinge' or 'softmax')

margin

margin parameter in case of hinge loss (numeric, defaults to 0.05)

similarity

cosine or dot product similarity in cas of hinge loss (character, defaults to 'cosine')

negSearchLimit

number of negatives sampled (integer, defaults to 50)

adagrad

whether to use adagrad in training (logical)

ws

the size of the context window for word level training - only used in trainMode 5 (integer, defaults to 5)

minCount

minimal number of word occurences for being part of the dictionary (integer, defaults to 1 keeping all words)

minCountLabel

minimal number of label occurences for being part of the dictionary (integer, defaults to 1 keeping all labels)

ngrams

max length of word ngram (integer, defaults to 1, using only unigrams)

thread

integer with the number of threads to use. Defaults to 1.

...

arguments passed on to ruimtehol:::textspace. See the details below.

Value

an object of class textspace which is a list with elements

Note

The function starspace is a tiny wrapper over the internal function ruimtehol:::textspace which allows direct access to the C++ code in order to run Starspace.
The following arguments are available in that functionality when you do the training. Default settings are shown next to the definition. Some of these arguments are directly set in the starspace function, others can be passed on with ... .

Arguments which define how the training is done:

Arguments specific to the dictionary of words and labels:

Arguments which define early stopping or proceeding of model building:

Other:

References

https://github.com/facebookresearch

Examples

## Not run: 
data(dekamer, package = "ruimtehol")
x <- strsplit(dekamer$question, "\\W")
x <- lapply(x, FUN = function(x) x[x != ""])
x <- sapply(x, FUN = function(x) paste(x, collapse = " "))

idx <- sample.int(n = nrow(dekamer), size = round(nrow(dekamer) * 0.7))
writeLines(x[idx], con = "traindata.txt")
writeLines(x[-idx], con = "validationdata.txt")

set.seed(123456789)
m <- starspace(file = "traindata.txt", validationFile = "validationdata.txt", 
               trainMode = 5, dim = 10, 
               loss = "softmax", lr = 0.01, ngrams = 2, minCount = 5,
               similarity = "cosine", adagrad = TRUE, ws = 7, epoch = 3,
               maxTrainTime = 10)
str(starspace_dictionary(m))              
wordvectors <- as.matrix(m)
wv <- starspace_embedding(m, 
                          x = c("Nationale Loterij", "migranten", "pensioen"),
                          type = "ngram")
wv
mostsimilar <- embedding_similarity(wordvectors, wv["pensioen", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)
starspace_knn(m, "koning")

## clean up for cran
file.remove(c("traindata.txt", "validationdata.txt"))

## End(Not run)

[Package ruimtehol version 0.3.2 Index]