tokenizers {textreuse} | R Documentation |
Split texts into tokens
Description
These functions each turn a text into tokens. The tokenize_ngrams
functions returns shingled n-grams.
Usage
tokenize_words(string, lowercase = TRUE)
tokenize_sentences(string, lowercase = TRUE)
tokenize_ngrams(string, lowercase = TRUE, n = 3)
tokenize_skip_ngrams(string, lowercase = TRUE, n = 3, k = 1)
Arguments
string |
A character vector of length 1 to be tokenized. |
lowercase |
Should the tokens be made lower case? |
n |
For n-gram tokenizers, the number of words in each n-gram. |
k |
For the skip n-gram tokenizer, the maximum skip distance between
words. The function will compute all skip n-grams between |
Details
These functions will strip all punctuation.
Value
A character vector containing the tokens.
Examples
dylan <- "How many roads must a man walk down? The answer is blowin' in the wind."
tokenize_words(dylan)
tokenize_sentences(dylan)
tokenize_ngrams(dylan, n = 2)
tokenize_skip_ngrams(dylan, n = 3, k = 2)
[Package textreuse version 0.1.5 Index]