split_text {deeplr} | R Documentation |
Split texts into segments
Description
split_text
splits texts into blocks of a maximum number of bytes.
Usage
split_text(text, max_size_bytes = 29000, tokenize = "sentences")
Arguments
text |
character vector to be split. |
max_size_bytes |
maximum size of a single text segment in bytes. |
tokenize |
level of tokenization. Either "sentences" or "words". |
Details
The function uses tokenizers::tokenize_sentences
to split texts.
Value
Returns a (tibble
) with the following columns:
-
text_id
position of the text in the character vector. -
segment_id
ID of a text segment. -
segment_text
text segment that is smaller thanmax_size_bytes
Examples
## Not run:
# Split long text
text <- paste0(rep("This is a very long text.", 10000), collapse = " ")
split_text(text)
## End(Not run)
[Package deeplr version 2.0.1 Index]