| bow_pp_create_basic_text_rep {aifeducation} | R Documentation |
Prepare texts for text embeddings with a bag of word approach.
Description
This function prepares raw texts for use with TextEmbeddingModel.
Usage
bow_pp_create_basic_text_rep(
data,
vocab_draft,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
split_tags = FALSE,
language_stopwords = "de",
use_lemmata = FALSE,
to_lower = FALSE,
min_termfreq = NULL,
min_docfreq = NULL,
max_docfreq = NULL,
window = 5,
weights = 1/(1:5),
trace = TRUE
)
Arguments
data |
|
vocab_draft |
Object created with bow_pp_create_vocab_draft. |
remove_punct |
|
remove_symbols |
|
remove_numbers |
|
remove_url |
|
remove_separators |
|
split_hyphens |
|
split_tags |
|
language_stopwords |
|
use_lemmata |
|
to_lower |
|
min_termfreq |
|
min_docfreq |
|
max_docfreq |
|
window |
|
weights |
|
trace |
|
Value
Returns a list of class basic_text_rep with the following components.
dfm:Document-Feature-Matrix. Rows correspond to the documents. Columns represent the number of tokens in the document.fcm:Feature-Co-Occurance-Matrix.information:listcontaining information about the used vocabulary. These are:n_sentence:Number of sentencesn_document_segments:Number of document segments/raw textsn_token_init:Number of initial tokensn_token_final:Number of final tokensn_lemmata:Number of lemmas
configuration:listcontaining information if the vocabulary was created with lower cases and if the vocabulary uses original tokens or lemmas.language_model:listcontaining information about the applied language model. These are:model:the udpipe language modellabel:the label of the udpipe language modelupos:the applied universal part-of-speech tagslanguage:the languagevocab:adata.framewith the original vocabulary
See Also
Other Preparation:
bow_pp_create_vocab_draft()