chunk_text {tokenizers} | R Documentation |
Chunk text into smaller segments
Description
Given a text or vector/list of texts, break the texts into smaller segments each with the same number of words. This allows you to treat a very long document, such as a novel, as a set of smaller documents.
Usage
chunk_text(x, chunk_size = 100, doc_id = names(x), ...)
Arguments
x |
A character vector or a list of character vectors to be tokenized
into n-grams. If |
chunk_size |
The number of words in each chunk. |
doc_id |
The document IDs as a character vector. This will be taken from
the names of the |
... |
Arguments passed on to |
Details
Chunking the text passes it through tokenize_words
,
which will strip punctuation and lowercase the text unless you provide
arguments to pass along to that function.
Examples
## Not run:
chunked <- chunk_text(mobydick, chunk_size = 100)
length(chunked)
chunked[1:3]
## End(Not run)