tokenize {gibasa} | R Documentation |
Tokenize sentences using 'MeCab'
Description
Tokenize sentences using 'MeCab'
Usage
tokenize(
x,
text_field = "text",
docid_field = "doc_id",
sys_dic = "",
user_dic = "",
split = FALSE,
partial = FALSE,
grain_size = 1L,
mode = c("parse", "wakati")
)
Arguments
x |
A data.frame like object or a character vector to be tokenized. |
text_field |
< |
docid_field |
< |
sys_dic |
Character scalar; path to the system dictionary for 'MeCab'. Note that the system dictionary is expected to be compiled with UTF-8, not Shift-JIS or other encodings. |
user_dic |
Character scalar; path to the user dictionary for 'MeCab'. |
split |
Logical. When passed as |
partial |
Logical. When passed as |
grain_size |
Integer value larger than 1.
This argument is internally passed to |
mode |
Character scalar to switch output format. |
Value
A tibble or a named list of tokens.
Examples
## Not run:
df <- tokenize(
data.frame(
doc_id = seq_along(5:8),
text = ginga[5:8]
)
)
head(df)
## End(Not run)