tokenize {textreuse} | R Documentation |
Recompute the tokens for a document or corpus
Description
Given a TextReuseTextDocument
or a
TextReuseCorpus
, this function recomputes the tokens and hashes
with the functions specified. Optionally, it can also recompute the minhash signatures.
Usage
tokenize(
x,
tokenizer,
...,
hash_func = hash_string,
minhash_func = NULL,
keep_tokens = FALSE,
keep_text = TRUE
)
Arguments
x |
|
tokenizer |
A function to split the text into tokens. See
|
... |
Arguments passed on to the |
hash_func |
A function to hash the tokens. See
|
minhash_func |
A function to create minhash signatures. See
|
keep_tokens |
Should the tokens be saved in the document that is returned or discarded? |
keep_text |
Should the text be saved in the document that is returned or discarded? |
Value
The modified TextReuseTextDocument
or
TextReuseCorpus
.
Examples
dir <- system.file("extdata/legal", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, tokenizer = NULL)
corpus <- tokenize(corpus, tokenize_ngrams)
head(tokens(corpus[[1]]))
[Package textreuse version 0.1.5 Index]