Tokenizer {NLP} | R Documentation |
Tokenizer objects
Description
Create tokenizer objects.
Usage
Span_Tokenizer(f, meta = list())
as.Span_Tokenizer(x, ...)
Token_Tokenizer(f, meta = list())
as.Token_Tokenizer(x, ...)
Arguments
f |
a tokenizer function taking the string to tokenize as
argument, and returning either the tokens (for
|
meta |
a named or empty list of tokenizer metadata tag-value pairs. |
x |
an R object. |
... |
further arguments passed to or from other methods. |
Details
Tokenization is the process of breaking a text string up into words, phrases, symbols, or other meaningful elements called tokens. This can be accomplished by returning the sequence of tokens, or the corresponding spans (character start and end positions). We refer to tokenization resources of the respective kinds as “token tokenizers” and “span tokenizers”.
Span_Tokenizer()
and Token_Tokenizer()
return tokenizer
objects which are functions with metadata and suitable class
information, which in turn can be used for converting between the two
kinds using as.Span_Tokenizer()
or as.Token_Tokenizer()
.
It is also possible to coerce annotator (pipeline) objects to
tokenizer objects, provided that the annotators provide suitable
token annotations. By default, word tokens are used; this can be
controlled via the type
argument of the coercion methods (e.g.,
use type = "sentence"
to extract sentence tokens).
There are also print()
and format()
methods for
tokenizer objects, which use the description
element of the
metadata if available.
See Also
Regexp_Tokenizer()
for creating regexp span tokenizers.
Examples
## A simple text.
s <- String(" First sentence. Second sentence. ")
## ****5****0****5****0****5****0****5**
## Use a pre-built regexp (span) tokenizer:
wordpunct_tokenizer
wordpunct_tokenizer(s)
## Turn into a token tokenizer:
tt <- as.Token_Tokenizer(wordpunct_tokenizer)
tt
tt(s)
## Of course, in this case we could simply have done
s[wordpunct_tokenizer(s)]
## to obtain the tokens from the spans.
## Conversion also works the other way round: package 'tm' provides
## the following token tokenizer function:
scan_tokenizer <- function(x)
scan(text = as.character(x), what = "character", quote = "",
quiet = TRUE)
## Create a token tokenizer from this:
tt <- Token_Tokenizer(scan_tokenizer)
tt(s)
## Turn into a span tokenizer:
st <- as.Span_Tokenizer(tt)
st(s)
## Checking tokens from spans:
s[st(s)]