BERT_vocab {FMAT}R Documentation

Check if mask words are in the model vocabulary.

Description

Check if mask words are in the model vocabulary.

Usage

BERT_vocab(
  models,
  mask.words,
  add.tokens = FALSE,
  add.method = c("sum", "mean")
)

Arguments

models

Model names at HuggingFace.

mask.words

Option words filling in the mask.

add.tokens

Add new tokens (for out-of-vocabulary words or even phrases) to model vocabulary? Defaults to FALSE. It only temporarily adds tokens for tasks but does not change the raw model file.

add.method

Method used to produce the token embeddings of new added tokens. Can be "sum" (default) or "mean" of subword token embeddings.

Value

A data.table of model name, mask word, real token (replaced if out of vocabulary), and token id (0~N).

See Also

BERT_download

BERT_info

FMAT_run

Examples

## Not run: 
models = c("bert-base-uncased", "bert-base-cased")
BERT_info(models)

BERT_vocab(models, c("bruce", "Bruce"))

BERT_vocab(models, 2020:2025)  # some are out-of-vocabulary
BERT_vocab(models, 2020:2025, add.tokens=TRUE)  # add vocab

BERT_vocab(models,
           c("individualism", "artificial intelligence"),
           add.tokens=TRUE)

## End(Not run)


[Package FMAT version 2024.7 Index]