tokenizers {NLP} | R Documentation |
Regexp tokenizers
Description
Tokenizers using regular expressions to match either tokens or separators between tokens.
Usage
Regexp_Tokenizer(pattern, invert = FALSE, ..., meta = list())
blankline_tokenizer(s)
whitespace_tokenizer(s)
wordpunct_tokenizer(s)
Arguments
pattern |
a character string giving the regular expression to use for matching. |
invert |
a logical indicating whether to match separators between tokens. |
... |
further arguments to be passed to |
meta |
a named or empty list of tokenizer metadata tag-value pairs. |
s |
a |
Details
Regexp_Tokenizer()
creates regexp span tokenizers which use the
given pattern
and ...
arguments to match tokens or
separators between tokens via gregexpr()
, and then
transform the results of this into character spans of the tokens
found.
whitespace_tokenizer()
tokenizes by treating any sequence of
whitespace characters as a separator.
blankline_tokenizer()
tokenizes by treating any sequence of
blank lines as a separator.
wordpunct_tokenizer()
tokenizes by matching sequences of
alphabetic characters and sequences of (non-whitespace) non-alphabetic
characters.
Value
Regexp_Tokenizer()
returns the created regexp span tokenizer.
blankline_tokenizer()
, whitespace_tokenizer()
and
wordpunct_tokenizer()
return the spans of the tokens found in
s
.
See Also
Span_Tokenizer()
for general information on span
tokenizer objects.
Examples
## A simple text.
s <- String(" First sentence. Second sentence. ")
## ****5****0****5****0****5****0****5**
spans <- whitespace_tokenizer(s)
spans
s[spans]
spans <- wordpunct_tokenizer(s)
spans
s[spans]