R: Corpora to Word List

corpora_to_word_list {keyToEnglish}

R Documentation

Corpora to Word List

Description

Converts a collection of documents to a word list

Usage

corpora_to_word_list(
  paths,
  ascii_only = TRUE,
  custom_regex = NA,
  max_word_length = 20,
  stopword_fn = DEFAULT_STOPWORDS,
  min_word_count = 5,
  max_size = 16^3,
  min_word_length = 3,
  output_file = NA,
  json_path = NA
)

Arguments

`paths`	Paths of plaintext documents
`ascii_only`	Will omit non-ascii characters if TRUE
`custom_regex`	If not NA, will override ascii_only and this will determine what a valid word consists of
`max_word_length`	Maximum length of extracted words
`stopword_fn`	Filename containing stopwords to use or a list of stopwords (if length > 1)
`min_word_count`	Minimum number of occurrences for a word to be added to word list
`max_size`	Maximum size of list
`min_word_length`	Minimum length of words
`output_file`	File to write list to
`json_path`	If input text is JSON, then it will be parsed as such if this is a character of JSON keys to follow

Value

A 'character' vector of words

[Package keyToEnglish version 0.2.1 Index]