textTopics {text} | R Documentation |
This function creates and trains a BERTopic model (based on bertopic python packaged) on a text-variable in a tibble/data.frame. (EXPERIMENTAL)
Description
This function creates and trains a BERTopic model (based on bertopic python packaged) on a text-variable in a tibble/data.frame. (EXPERIMENTAL)
Usage
textTopics(
data,
variable_name,
embedding_model = "distilroberta",
umap_model = "default",
hdbscan_model = "default",
vectorizer_model = "default",
representation_model = "mmr",
num_top_words = 10,
n_gram_range = c(1, 3),
stopwords = "english",
min_df = 5,
bm25_weighting = FALSE,
reduce_frequent_words = TRUE,
set_seed = 8,
save_dir = "./results"
)
Arguments
data |
(tibble/data.frame) A tibble with a text-variable to be analysed, and optional numeric/categorical variables that you might want to use for later analyses testing the significance of topics in relation to these variables. |
variable_name |
(string) Name of the text-variable in the data tibble that you want to perform topic modeling on. |
embedding_model |
(string) Name of the embedding model to use such as "miniLM", "mpnet", "multi-mpnet", "distilroberta". |
umap_model |
(string) The dimension reduction algorithm, currently only "default" is supported. |
hdbscan_model |
(string) The clustering algorithm to use, currently only "default" is supported. |
vectorizer_model |
(string) Name of the vectorizer model, currently only "default" is supported. |
representation_model |
(string) Name of the representation model used for topics, including "keybert" or "mmr". |
num_top_words |
(integer) Determine the number of top words presented for each topic. |
n_gram_range |
(vector) Two-dimensional vector indicating the ngram range used for the vectorizer model. |
stopwords |
(string) Name of the stopword dictionary to use. |
min_df |
(integer) The minimum document frequency of terms. |
bm25_weighting |
(boolean) Determine whether bm25_weighting is used for ClassTfidfTransformer. |
reduce_frequent_words |
(boolean) Determine whether frequent words are reduced by ClassTfidfTransformer. |
set_seed |
(integer) The random seed for initialization of the umap model. |
save_dir |
(string) The directory for saving results. |
Value
A folder containing the model, data, folder with terms and values for each topic,
and the document-topic matrix. Moreover the model itself is returned formatted as a data.frame
together with metdata.
See textTopicsReduce
textTopicsTest
and textTopicsWordcloud
.