spacy_tokenize {spacyr} | R Documentation |
Tokenize text with spaCy
Description
Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.
Usage
spacy_tokenize(
x,
what = c("word", "sentence"),
remove_punct = FALSE,
remove_url = FALSE,
remove_numbers = FALSE,
remove_separators = TRUE,
remove_symbols = FALSE,
padding = FALSE,
multithread = TRUE,
output = c("list", "data.frame"),
...
)
Arguments
x |
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
what |
the unit for splitting the text, available alternatives are:
|
remove_punct |
remove punctuation tokens. |
remove_url |
remove tokens that look like a url or email address. |
remove_numbers |
remove tokens that look like a number (e.g. "334", "3.1415", "fifty"). |
remove_separators |
remove spaces as separators when
all other remove functionalities (e.g. |
remove_symbols |
remove symbols. The symbols are either |
padding |
if |
multithread |
logical; If |
output |
type of returning object. Either |
... |
not used directly |
Value
either list
or data.frame
of tokens
Examples
## Not run:
spacy_initialize()
txt <- "And now for something completely different."
spacy_tokenize(txt)
txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.",
doc2 = "This is the second document.",
doc3 = "This is a \\\"quoted\\\" text." )
spacy_tokenize(txt2)
## End(Not run)