tokenizer {tm} | R Documentation |
Tokenizers
Description
Tokenize a document or character vector.
Usage
Boost_tokenizer(x)
MC_tokenizer(x)
scan_tokenizer(x)
Arguments
x |
A character vector, or an object that can be coerced to character by
|
Details
The quality and correctness of a tokenization algorithm highly depends on the context and application scenario. Relevant factors are the language of the underlying text and the notions of whitespace (which can vary with the used encoding and the language) and punctuation marks. Consequently, for superior results you probably need a custom tokenization function.
- Boost_tokenizer
Uses the Boost (https://www.boost.org) Tokenizer (via Rcpp).
- MC_tokenizer
Implements the functionality of the tokenizer in the MC toolkit (https://www.cs.utexas.edu/users/dml/software/mc/).
- scan_tokenizer
Simulates
scan(..., what = "character")
.
Value
A character vector consisting of tokens obtained by tokenization of x
.
See Also
getTokenizers
to list tokenizers provided by package tm.
Regexp_Tokenizer
for tokenizers using regular expressions
provided by package NLP.
tokenize
for a simple regular expression based tokenizer
provided by package tau.
tokenizers
for a collection of tokenizers provided
by package tokenizers.
Examples
data("crude")
Boost_tokenizer(crude[[1]])
MC_tokenizer(crude[[1]])
scan_tokenizer(crude[[1]])
strsplit_space_tokenizer <- function(x)
unlist(strsplit(as.character(x), "[[:space:]]+"))
strsplit_space_tokenizer(crude[[1]])