corpora_to_word_list {keyToEnglish}R Documentation

Corpora to Word List

Description

Converts a collection of documents to a word list

Usage

corpora_to_word_list(
  paths,
  ascii_only = TRUE,
  custom_regex = NA,
  max_word_length = 20,
  stopword_fn = DEFAULT_STOPWORDS,
  min_word_count = 5,
  max_size = 16^3,
  min_word_length = 3,
  output_file = NA,
  json_path = NA
)

Arguments

paths

Paths of plaintext documents

ascii_only

Will omit non-ascii characters if TRUE

custom_regex

If not NA, will override ascii_only and this will determine what a valid word consists of

max_word_length

Maximum length of extracted words

stopword_fn

Filename containing stopwords to use or a list of stopwords (if length > 1)

min_word_count

Minimum number of occurrences for a word to be added to word list

max_size

Maximum size of list

min_word_length

Minimum length of words

output_file

File to write list to

json_path

If input text is JSON, then it will be parsed as such if this is a character of JSON keys to follow

Value

A 'character' vector of words


[Package keyToEnglish version 0.2.1 Index]