R: This function creates and trains a BERTopic model (based on...

textTopics {text}

R Documentation

This function creates and trains a BERTopic model (based on bertopic python packaged) on a text-variable in a tibble/data.frame. (EXPERIMENTAL)

Description

This function creates and trains a BERTopic model (based on bertopic python packaged) on a text-variable in a tibble/data.frame. (EXPERIMENTAL)

Usage

textTopics(
  data,
  variable_name,
  embedding_model = "distilroberta",
  umap_model = "default",
  hdbscan_model = "default",
  vectorizer_model = "default",
  representation_model = "mmr",
  num_top_words = 10,
  n_gram_range = c(1, 3),
  stopwords = "english",
  min_df = 5,
  bm25_weighting = FALSE,
  reduce_frequent_words = TRUE,
  set_seed = 8,
  save_dir = "./results"
)

Arguments

`data`	(tibble/data.frame) A tibble with a text-variable to be analysed, and optional numeric/categorical variables that you might want to use for later analyses testing the significance of topics in relation to these variables.
`variable_name`	(string) Name of the text-variable in the data tibble that you want to perform topic modeling on.
`embedding_model`	(string) Name of the embedding model to use such as "miniLM", "mpnet", "multi-mpnet", "distilroberta".
`umap_model`	(string) The dimension reduction algorithm, currently only "default" is supported.
`hdbscan_model`	(string) The clustering algorithm to use, currently only "default" is supported.
`vectorizer_model`	(string) Name of the vectorizer model, currently only "default" is supported.
`representation_model`	(string) Name of the representation model used for topics, including "keybert" or "mmr".
`num_top_words`	(integer) Determine the number of top words presented for each topic.
`n_gram_range`	(vector) Two-dimensional vector indicating the ngram range used for the vectorizer model.
`stopwords`	(string) Name of the stopword dictionary to use.
`min_df`	(integer) The minimum document frequency of terms.
`bm25_weighting`	(boolean) Determine whether bm25_weighting is used for ClassTfidfTransformer.
`reduce_frequent_words`	(boolean) Determine whether frequent words are reduced by ClassTfidfTransformer.
`set_seed`	(integer) The random seed for initialization of the umap model.
`save_dir`	(string) The directory for saving results.

Value

A folder containing the model, data, folder with terms and values for each topic, and the document-topic matrix. Moreover the model itself is returned formatted as a data.frame together with metdata. See textTopicsReduce textTopicsTest and textTopicsWordcloud.

[Package text version 1.2.3 Index]