tokenize {PsychWordVec}R Documentation

Tokenize raw text for training word embeddings.

Description

Tokenize raw text for training word embeddings.

Usage

tokenize(
  text,
  tokenizer = text2vec::word_tokenizer,
  split = " ",
  remove = "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.",
  encoding = "UTF-8",
  simplify = TRUE,
  verbose = TRUE
)

Arguments

text

A character vector of text, or a file path on disk containing text.

tokenizer

Function used to tokenize the text. Defaults to text2vec::word_tokenizer.

split

Separator between tokens, only used when simplify=TRUE. Defaults to " ".

remove

Strings (in regular expression) to be removed from the text. Defaults to "_|'|<br/>|<br />|e\\.g\\.|i\\.e\\.". You may turn off this by specifying remove=NULL.

encoding

Text encoding (only used if text is a file). Defaults to "UTF-8".

simplify

Return a character vector (TRUE) or a list of character vectors (FALSE). Defaults to TRUE.

verbose

Print information to the console? Defaults to TRUE.

Value

See Also

train_wordvec

Examples

txt1 = c(
  "I love natural language processing (NLP)!",
  "I've been in this city for 10 years. I really like here!",
  "However, my computer is not among the \"Top 10\" list."
)
tokenize(txt1, simplify=FALSE)
tokenize(txt1) %>% cat(sep="\n----\n")

txt2 = text2vec::movie_review$review[1:5]
texts = tokenize(txt2)

txt2[1]
texts[1:20]  # all sentences in txt2[1]


[Package PsychWordVec version 2023.9 Index]