R: Function for training and fine-tuning a BERT model

train_tune_bert_model {aifeducation}

R Documentation

Function for training and fine-tuning a BERT model

Description

This function can be used to train or fine-tune a transformer based on BERT architecture with the help of the python libraries 'transformers', 'datasets', and 'tokenizers'.

Usage

train_tune_bert_model(
  ml_framework = aifeducation_config$get_framework(),
  output_dir,
  model_dir_path,
  raw_texts,
  p_mask = 0.15,
  whole_word = TRUE,
  val_size = 0.1,
  n_epoch = 1,
  batch_size = 12,
  chunk_size = 250,
  full_sequences_only = FALSE,
  min_seq_len = 50,
  learning_rate = 0.003,
  n_workers = 1,
  multi_process = FALSE,
  sustain_track = TRUE,
  sustain_iso_code = NULL,
  sustain_region = NULL,
  sustain_interval = 15,
  trace = TRUE,
  keras_trace = 1,
  pytorch_trace = 1,
  pytorch_safetensors = TRUE
)

Arguments

`ml_framework`	`string` Framework to use for training and inference. `ml_framework="tensorflow"` for 'tensorflow' and `ml_framework="pytorch"` for 'pytorch'.
`output_dir`	`string` Path to the directory where the final model should be saved. If the directory does not exist, it will be created.
`model_dir_path`	`string` Path to the directory where the original model is stored.
`raw_texts`	`vector` containing the raw texts for training.
`p_mask`	`double` Ratio determining the number of words/tokens for masking.
`whole_word`	`bool` `TRUE` if whole word masking should be applied. If `FALSE` token masking is used.
`val_size`	`double` Ratio determining the amount of token chunks used for validation.
`n_epoch`	`int` Number of epochs for training.
`batch_size`	`int` Size of batches.
`chunk_size`	`int` Size of every chunk for training.
`full_sequences_only`	`bool` `TRUE` for using only chunks with a sequence length equal to `chunk_size`.
`min_seq_len`	`int` Only relevant if `full_sequences_only=FALSE`. Value determines the minimal sequence length for inclusion in training process.
`learning_rate`	`double` Learning rate for adam optimizer.
`n_workers`	`int` Number of workers. Only relevant if `ml_framework="tensorflow"`.
`multi_process`	`bool` `TRUE` if multiple processes should be activated. Only relevant if `ml_framework="tensorflow"`.
`sustain_track`	`bool` If `TRUE` energy consumption is tracked during training via the python library codecarbon.
`sustain_iso_code`	`string` ISO code (Alpha-3-Code) for the country. This variable must be set if sustainability should be tracked. A list can be found on Wikipedia: https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes.
`sustain_region`	Region within a country. Only available for USA and Canada See the documentation of codecarbon for more information. https://mlco2.github.io/codecarbon/parameters.html
`sustain_interval`	`integer` Interval in seconds for measuring power usage.
`trace`	`bool` `TRUE` if information on the progress should be printed to the console.
`keras_trace`	`int` `keras_trace=0` does not print any information about the training process from keras on the console. `keras_trace=1` prints a progress bar. `keras_trace=2` prints one line of information for every epoch. Only relevant if `ml_framework="tensorflow"`.
`pytorch_trace`	`int` `pytorch_trace=0` does not print any information about the training process from pytorch on the console. `pytorch_trace=1` prints a progress bar.
`pytorch_safetensors`	`bool` If `TRUE` a 'pytorch' model is saved in safetensors format. If `FALSE` or 'safetensors' not available it is saved in the standard pytorch format (.bin). Only relevant for pytorch models.

Value

This function does not return an object. Instead the trained or fine-tuned model is saved to disk.

Note

This models uses a WordPiece Tokenizer like BERT and can be trained with whole word masking. Transformer library may show a warning which can be ignored.

Pre-Trained models which can be fine-tuned with this function are available at https://huggingface.co/.

New models can be created via the function create_bert_model.

Training of the model makes use of dynamic masking in contrast to the original paper where static masking was applied.

References

Devlin, J., Chang, M.‑W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 Conference of the North (pp. 4171–4186). Association for Computational Linguistics. doi:10.18653/v1/N19-1423

Hugging Face documentation https://huggingface.co/docs/transformers/model_doc/bert#transformers.TFBertForMaskedLM