corpora_to_word_list {keyToEnglish} | R Documentation |
Corpora to Word List
Description
Converts a collection of documents to a word list
Usage
corpora_to_word_list(
paths,
ascii_only = TRUE,
custom_regex = NA,
max_word_length = 20,
stopword_fn = DEFAULT_STOPWORDS,
min_word_count = 5,
max_size = 16^3,
min_word_length = 3,
output_file = NA,
json_path = NA
)
Arguments
paths |
Paths of plaintext documents |
ascii_only |
Will omit non-ascii characters if TRUE |
custom_regex |
If not NA, will override ascii_only and this will determine what a valid word consists of |
max_word_length |
Maximum length of extracted words |
stopword_fn |
Filename containing stopwords to use or a list of stopwords (if length > 1) |
min_word_count |
Minimum number of occurrences for a word to be added to word list |
max_size |
Maximum size of list |
min_word_length |
Minimum length of words |
output_file |
File to write list to |
json_path |
If input text is JSON, then it will be parsed as such if this is a character of JSON keys to follow |
Value
A 'character' vector of words
[Package keyToEnglish version 0.2.1 Index]