tokenize_character_shingles {tokenizers} | R Documentation |
Character shingle tokenizers
Description
The character shingle tokenizer functions like an n-gram tokenizer, except the units that are shingled are characters instead of words. Options to the function let you determine whether non-alphanumeric characters like punctuation should be retained or discarded.
Usage
tokenize_character_shingles(
x,
n = 3L,
n_min = n,
lowercase = TRUE,
strip_non_alphanum = TRUE,
simplify = FALSE
)
Arguments
x |
A character vector or a list of character vectors to be tokenized
into character shingles. If |
n |
The number of characters in each shingle. This must be an integer greater than or equal to 1. |
n_min |
This must be an integer greater than or equal to 1, and less
than or equal to |
lowercase |
Should the characters be made lower case? |
strip_non_alphanum |
Should punctuation and white space be stripped? |
simplify |
|
Value
A list of character vectors containing the tokens, with one element
in the list for each element that was passed as input. If simplify =
TRUE
and only a single element was passed as input, then the output is a
character vector of tokens.
Examples
x <- c("Now is the hour of our discontent")
tokenize_character_shingles(x)
tokenize_character_shingles(x, n = 5)
tokenize_character_shingles(x, n = 5, strip_non_alphanum = FALSE)
tokenize_character_shingles(x, n = 5, n_min = 3, strip_non_alphanum = FALSE)