bow_pp_create_basic_text_rep {aifeducation}R Documentation

Prepare texts for text embeddings with a bag of word approach.

Description

This function prepares raw texts for use with TextEmbeddingModel.

Usage

bow_pp_create_basic_text_rep(
  data,
  vocab_draft,
  remove_punct = TRUE,
  remove_symbols = TRUE,
  remove_numbers = TRUE,
  remove_url = TRUE,
  remove_separators = TRUE,
  split_hyphens = FALSE,
  split_tags = FALSE,
  language_stopwords = "de",
  use_lemmata = FALSE,
  to_lower = FALSE,
  min_termfreq = NULL,
  min_docfreq = NULL,
  max_docfreq = NULL,
  window = 5,
  weights = 1/(1:5),
  trace = TRUE
)

Arguments

data

vector containing the raw texts.

vocab_draft

Object created with bow_pp_create_vocab_draft.

remove_punct

bool TRUE if punctuation should be removed.

remove_symbols

bool TRUE if symbols should be removed.

remove_numbers

bool TRUE if numbers should be removed.

remove_url

bool TRUE if urls should be removed.

remove_separators

bool TRUE if separators should be removed.

split_hyphens

bool TRUE if hyphens should be split into several tokens.

split_tags

bool TRUE if tags should be split.

language_stopwords

string Abbreviation for the language for which stopwords should be removed.

use_lemmata

bool TRUE lemmas instead of original tokens should be used.

to_lower

bool TRUE if tokens or lemmas should be used with lower cases.

min_termfreq

int Minimum frequency of a token to be part of the vocabulary.

min_docfreq

int Minimum appearance of a token in documents to be part of the vocabulary.

max_docfreq

int Maximum appearance of a token in documents to be part of the vocabulary.

window

int size of the window for creating the feature-co-occurance matrix.

weights

vector weights for the corresponding window. The vector length must be equal to the window size.

trace

bool TRUE if information about the progress should be printed to console.

Value

Returns a list of class basic_text_rep with the following components.

See Also

Other Preparation: bow_pp_create_vocab_draft()


[Package aifeducation version 0.3.3 Index]