itoken {text2vec} | R Documentation |
Iterators (and parallel iterators) over input objects
Description
This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.
Usage
itoken(iterable, ...)
## S3 method for class 'character'
itoken(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 10,
progressbar = interactive(), ids = NULL, ...)
## S3 method for class 'list'
itoken(iterable, n_chunks = 10,
progressbar = interactive(), ids = names(iterable), ...)
## S3 method for class 'iterator'
itoken(iterable, preprocessor = identity,
tokenizer = space_tokenizer, progressbar = interactive(), ...)
itoken_parallel(iterable, ...)
## S3 method for class 'character'
itoken_parallel(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...)
## S3 method for class 'iterator'
itoken_parallel(iterable, preprocessor = identity,
tokenizer = space_tokenizer, n_chunks = 1L, ...)
## S3 method for class 'list'
itoken_parallel(iterable, n_chunks = 10, ids = NULL,
...)
Arguments
iterable |
an object from which to generate an iterator |
... |
arguments passed to other methods |
preprocessor |
|
tokenizer |
|
n_chunks |
|
progressbar |
|
ids |
|
Details
S3 methods for creating an itoken iterator from list of tokens
list
: all elements of the input list should be character vectors containing tokenscharacter
: raw text source: the user must provide a tokenizer functionifiles
: from files, a user must provide a function to read in the file (to ifiles) and a function to tokenize it (to itoken)idir
: from a directory, the user must provide a function to read in the files (to idir) and a function to tokenize it (to itoken)ifiles_parallel
: from files in parallel
See Also
ifiles, idir, create_vocabulary, create_dtm, vectorizers, create_tcm
Examples
data("movie_review")
txt = movie_review$review[1:100]
ids = movie_review$id[1:100]
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10)
it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids)
# Example of stemming tokenizer
# stem_tokenizer =function(x) {
# lapply(word_tokenizer(x), SnowballC::wordStem, language="en")
# }
it = itoken_parallel(movie_review$review[1:100], n_chunks = 4)
system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix'))