tokenize {PsychWordVec} | R Documentation |
Tokenize raw text for training word embeddings.
Description
Tokenize raw text for training word embeddings.
Usage
tokenize(
text,
tokenizer = text2vec::word_tokenizer,
split = " ",
remove = "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.",
encoding = "UTF-8",
simplify = TRUE,
verbose = TRUE
)
Arguments
text |
A character vector of text, or a file path on disk containing text. |
tokenizer |
Function used to tokenize the text.
Defaults to |
split |
Separator between tokens, only used when |
remove |
Strings (in regular expression) to be removed from the text.
Defaults to |
encoding |
Text encoding (only used if |
simplify |
Return a character vector ( |
verbose |
Print information to the console? Defaults to |
Value
simplify=TRUE
: A tokenized character vector, with each element as a sentence.simplify=FALSE
: A list of tokenized character vectors, with each element as a vector of tokens in a sentence.
See Also
Examples
txt1 = c(
"I love natural language processing (NLP)!",
"I've been in this city for 10 years. I really like here!",
"However, my computer is not among the \"Top 10\" list."
)
tokenize(txt1, simplify=FALSE)
tokenize(txt1) %>% cat(sep="\n----\n")
txt2 = text2vec::movie_review$review[1:5]
texts = tokenize(txt2)
txt2[1]
texts[1:20] # all sentences in txt2[1]