create_vocabulary {text2vec} | R Documentation |
Creates a vocabulary of unique terms
Description
This function collects unique terms and corresponding statistics. See the below for details.
Usage
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L),
stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
## S3 method for class 'character'
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
## S3 method for class 'itoken'
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
## S3 method for class 'itoken_parallel'
create_vocabulary(it, ngram = c(ngram_min = 1L,
ngram_max = 1L), stopwords = character(0), sep_ngram = "_",
window_size = 0L, ...)
Arguments
it |
iterator over a |
ngram |
|
stopwords |
|
sep_ngram |
|
window_size |
|
... |
placeholder for additional arguments (not used at the moment). |
Value
text2vec_vocabulary
object, which is actually a data.frame
with following columns:
term |
|
term_count |
|
doc_count |
|
Also it contains metainformation in attributes:
ngram
: integer
vector, the lower and upper boundary of the
range of n-gram-values.
document_count
: integer
number of documents vocabulary was
built.
stopwords
: character
vector of stopwords
sep_ngram
: character
separator for ngrams
Methods (by class)
-
character
: createstext2vec_vocabulary
from predefined character vector. Terms will be inserted as is, without any checks (ngrams number, ngram delimiters, etc.). -
itoken
: collects unique terms and corresponding statistics from object. -
itoken_parallel
: collects unique terms and corresponding statistics from iterator.
Examples
data("movie_review")
txt = movie_review[['review']][1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
vocab = create_vocabulary(it)
pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8,
doc_proportion_min = 0.001, vocab_term_max = 20000)