| tokenizers {text2vec} | R Documentation | 
Simple tokenization functions for string splitting
Description
Few simple tokenization functions. For more comprehensive list see tokenizers package:
https://cran.r-project.org/package=tokenizers.
Also check stringi::stri_split_*.
Usage
word_tokenizer(strings, ...)
char_tokenizer(strings, ...)
space_tokenizer(strings, sep = " ", xptr = FALSE, ...)
postag_lemma_tokenizer(strings, udpipe_model, tagger = "default",
  tokenizer = "tokenizer", pos_keep = character(0),
  pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ",
  "AUX", "X", "INTJ"))
Arguments
| strings | 
 | 
| ... | other parameters (usually not used - see source code for details). | 
| sep | 
 | 
| xptr | 
 | 
| udpipe_model | - udpipe model, can be loaded with  | 
| tagger | 
 | 
| tokenizer | 
 | 
| pos_keep | 
 | 
| pos_remove | 
 | 
Value
list of character vectors. Each element of list contains vector of tokens.
Examples
doc = c("first  second", "bla, bla, blaa")
# split by words
word_tokenizer(doc)
#faster, but far less general - perform split by a fixed single whitespace symbol.
space_tokenizer(doc, " ")