| tokenizers {NLP} | R Documentation |
Regexp tokenizers
Description
Tokenizers using regular expressions to match either tokens or separators between tokens.
Usage
Regexp_Tokenizer(pattern, invert = FALSE, ..., meta = list())
blankline_tokenizer(s)
whitespace_tokenizer(s)
wordpunct_tokenizer(s)
Arguments
pattern |
a character string giving the regular expression to use for matching. |
invert |
a logical indicating whether to match separators between tokens. |
... |
further arguments to be passed to |
meta |
a named or empty list of tokenizer metadata tag-value pairs. |
s |
a |
Details
Regexp_Tokenizer() creates regexp span tokenizers which use the
given pattern and ... arguments to match tokens or
separators between tokens via gregexpr(), and then
transform the results of this into character spans of the tokens
found.
whitespace_tokenizer() tokenizes by treating any sequence of
whitespace characters as a separator.
blankline_tokenizer() tokenizes by treating any sequence of
blank lines as a separator.
wordpunct_tokenizer() tokenizes by matching sequences of
alphabetic characters and sequences of (non-whitespace) non-alphabetic
characters.
Value
Regexp_Tokenizer() returns the created regexp span tokenizer.
blankline_tokenizer(), whitespace_tokenizer() and
wordpunct_tokenizer() return the spans of the tokens found in
s.
See Also
Span_Tokenizer() for general information on span
tokenizer objects.
Examples
## A simple text.
s <- String(" First sentence. Second sentence. ")
## ****5****0****5****0****5****0****5**
spans <- whitespace_tokenizer(s)
spans
s[spans]
spans <- wordpunct_tokenizer(s)
spans
s[spans]