bow_pp_create_basic_text_rep {aifeducation} | R Documentation |
Prepare texts for text embeddings with a bag of word approach.
Description
This function prepares raw texts for use with TextEmbeddingModel.
Usage
bow_pp_create_basic_text_rep(
data,
vocab_draft,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE,
remove_separators = TRUE,
split_hyphens = FALSE,
split_tags = FALSE,
language_stopwords = "de",
use_lemmata = FALSE,
to_lower = FALSE,
min_termfreq = NULL,
min_docfreq = NULL,
max_docfreq = NULL,
window = 5,
weights = 1/(1:5),
trace = TRUE
)
Arguments
data |
|
vocab_draft |
Object created with bow_pp_create_vocab_draft. |
remove_punct |
|
remove_symbols |
|
remove_numbers |
|
remove_url |
|
remove_separators |
|
split_hyphens |
|
split_tags |
|
language_stopwords |
|
use_lemmata |
|
to_lower |
|
min_termfreq |
|
min_docfreq |
|
max_docfreq |
|
window |
|
weights |
|
trace |
|
Value
Returns a list
of class basic_text_rep
with the following components.
dfm:
Document-Feature-Matrix. Rows correspond to the documents. Columns represent the number of tokens in the document.fcm:
Feature-Co-Occurance-Matrix.information:
list
containing information about the used vocabulary. These are:n_sentence:
Number of sentencesn_document_segments:
Number of document segments/raw textsn_token_init:
Number of initial tokensn_token_final:
Number of final tokensn_lemmata:
Number of lemmas
configuration:
list
containing information if the vocabulary was created with lower cases and if the vocabulary uses original tokens or lemmas.language_model:
list
containing information about the applied language model. These are:model:
the udpipe language modellabel:
the label of the udpipe language modelupos:
the applied universal part-of-speech tagslanguage:
the languagevocab:
adata.frame
with the original vocabulary
See Also
Other Preparation:
bow_pp_create_vocab_draft()