R: Automatically label topics using language models based on top...

label_topics {topiclabels}

R Documentation

Automatically label topics using language models based on top terms

Description

Performs an automated labeling process of topics from topic models using language models. For this, the top terms and (optionally) a short context description are used.

Usage

label_topics(
  terms,
  model = "mistralai/Mixtral-8x7B-Instruct-v0.1",
  params = list(),
  token = NA_character_,
  context = "",
  sep_terms = "; ",
  max_length_label = 5L,
  prompt_type = c("json", "plain", "json-roles"),
  max_wait = 0L,
  progress = TRUE
)

Arguments

`terms`	[`list (k) of character`] List (each list entry represents one topic) of `character` vectors containing the top terms representing the topics that are to be labeled. If a single `character` vector is passed, this is interpreted as the top terms of a single topic. If a `character` matrix is passed, each column is interpreted as the top terms of a topic.
`model`	[`character(1)`] Optional. The language model to use for labeling the topics. The model must be accessible via the Huggingface API. Default is `mistralai/Mixtral-8x7B-Instruct-v0.1`. Other promising models are `HuggingFaceH4/zephyr-7b-beta` or `tiiuae/falcon-7b-instruct`.
`params`	[`named list`] Optional. Model parameters to pass. Default parameters for common models are given in the details section.
`token`	[`character(1)`] Optional. If you want to address the Huggingface API with a Huggingface token, enter it here. The main advantage of this is a higher rate limit.
`context`	[`character(1)`] Optional. Explanatory context for the topics to be labeled. Using a (very) brief explanation of the thematic context may greatly improve the usefulness of automatically generated topic labels.
`sep_terms`	[`character(1)`] How should the top terms of a single topic be separated in the generated prompts? Default is separation via semicolon and space.
`max_length_label`	[`integer(1)`] What is the maximum number of words a label should consist of? Default is five words.
`prompt_type`	[`character(1)`] Which prompt type should be applied. We implemented various prompt types that differ mainly in how the response of the language model is requested. Examples are given in the details section. Default is the request of a json output.
`max_wait`	[`integer(1)`] In the case that the rate limit on Huggingface is reached: How long (in minutes) should the system wait until it asks the user whether to continue (in other words: to wait). The default is zero minutes, i.e the user is asked every time the rate limit is reached.
`progress`	[`logical(1)`] Should a nice progress bar be shown? Turning it off, could lead to significantly faster calculation. Default ist `TRUE`.

Details

The function builds helpful prompts based on the top terms and sends these prompts to language models on Huggingface. The output is in turn post-processed so that the labels for each topic are extracted automatically. If the automatically extracted labels show any errors, they can alternatively be extracted using custom functions or manually from the original output of the model using the model_output entry of the lm_topic_labels object.

Implemented default parameters for the models HuggingFaceH4/zephyr-7b-beta, tiiuae/falcon-7b-instruct, and mistralai/Mixtral-8x7B-Instruct-v0.1 are:

max_new_tokens: 300
return_full_text: FALSE

Implemented prompt types are:

json: the language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic
plain: the language model is asked to return an answer that should only consist of the best label for the topic
json-roles: the language model is asked to respond in JSON format with a single field called 'label', specifying the best label for the topic; in addition, the model is queried using identifiers for <|user|> input and the beginning of the <|assistant|> output

Value

[named list] lm_topic_labels object.

Examples

## Not run: 
token = "" # please insert your hf token here
topwords_matrix = matrix(c("zidane", "figo", "kroos",
                           "gas", "power", "wind"), ncol = 2)
label_topics(topwords_matrix, token = token)
label_topics(list(c("zidane", "figo", "kroos"),
                  c("gas", "power", "wind")),
             token = token)
label_topics(list(c("zidane", "figo", "ronaldo"),
                  c("gas", "power", "wind")),
             token = token)

label_topics(list("wind", "greta", "hambach"),
             token = token)
label_topics(list("wind", "fire", "air"),
             token = token)
label_topics(list("wind", "feuer", "luft"),
             token = token)
label_topics(list("wind", "feuer", "luft"),
             context = "Elements of the Earth",
             token = token)

## End(Not run)

[Package topiclabels version 0.1.0 Index]