R: Prepare texts for text embeddings with a bag of word...

bow_pp_create_basic_text_rep {aifeducation}

R Documentation

Prepare texts for text embeddings with a bag of word approach.

Description

This function prepares raw texts for use with TextEmbeddingModel.

Usage

bow_pp_create_basic_text_rep(
  data,
  vocab_draft,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_url = TRUE,
  remove_separators = TRUE,
  split_hyphens = FALSE,
  split_tags = FALSE,
  language_stopwords = "de",
  use_lemmata = FALSE,
  to_lower = FALSE,
  min_termfreq = NULL,
  min_docfreq = NULL,
  max_docfreq = NULL,
  window = 5,
  weights = 1/(1:5),
  trace = TRUE
)

Arguments

`data`	`vector` containing the raw texts.
`vocab_draft`	Object created with bow_pp_create_vocab_draft.
`remove_punct`	`bool` `TRUE` if punctuation should be removed.
`remove_symbols`	`bool` `TRUE` if symbols should be removed.
`remove_numbers`	`bool` `TRUE` if numbers should be removed.
`remove_url`	`bool` `TRUE` if urls should be removed.
`remove_separators`	`bool` `TRUE` if separators should be removed.
`split_hyphens`	`bool` `TRUE` if hyphens should be split into several tokens.
`split_tags`	`bool` `TRUE` if tags should be split.
`language_stopwords`	`string` Abbreviation for the language for which stopwords should be removed.
`use_lemmata`	`bool` `TRUE` lemmas instead of original tokens should be used.
`to_lower`	`bool` `TRUE` if tokens or lemmas should be used with lower cases.
`min_termfreq`	`int` Minimum frequency of a token to be part of the vocabulary.
`min_docfreq`	`int` Minimum appearance of a token in documents to be part of the vocabulary.
`max_docfreq`	`int` Maximum appearance of a token in documents to be part of the vocabulary.
`window`	`int` size of the window for creating the feature-co-occurance matrix.
`weights`	`vector` weights for the corresponding window. The vector length must be equal to the window size.
`trace`	`bool` `TRUE` if information about the progress should be printed to console.

Value

Returns a list of class basic_text_rep with the following components.

dfm: Document-Feature-Matrix. Rows correspond to the documents. Columns represent the number of tokens in the document.
fcm: Feature-Co-Occurance-Matrix.
information: list containing information about the used vocabulary. These are:
- n_sentence: Number of sentences
- n_document_segments: Number of document segments/raw texts
- n_token_init: Number of initial tokens
- n_token_final: Number of final tokens
- n_lemmata: Number of lemmas
configuration: list containing information if the vocabulary was created with lower cases and if the vocabulary uses original tokens or lemmas.
language_model: list containing information about the applied language model. These are:
- model: the udpipe language model
- label: the label of the udpipe language model
- upos: the applied universal part-of-speech tags
- language: the language
- vocab: a data.frame with the original vocabulary

Prepare texts for text embeddings with a bag of word approach.

Description

Usage

Arguments

Value

See Also