build_user_dic {gibasa} | R Documentation |
Build user dictionary
Description
Builds a UTF-8 user dictionary from a csv file.
Usage
build_user_dic(dic_dir, file, csv_file, encoding)
Arguments
dic_dir |
Directory where the source dictionaries are located. This argument is passed as '-d' option argument. |
file |
Path to write the user dictionary. This argument is passed as '-u' option argument. |
csv_file |
Path to an input csv file. |
encoding |
Encoding of input csv files. This argument is passed as '-f' option argument. |
Details
This function is a wrapper around dictionary compiler of 'MeCab'.
Note that this function does not support auto assignment of word cost field.
So, you can't leave any word costs as empty in your input csv file.
To estimate word costs, use posDebugRcpp()
function.
Value
A TRUE
is invisibly returned if dictionary is successfully built.
Examples
if (requireNamespace("withr")) {
# create a sample dictionary in temporary directory
build_sys_dic(
dic_dir = system.file("latin", package = "gibasa"),
out_dir = tempdir(),
encoding = "utf8"
)
# copy the 'dicrc' file
file.copy(
system.file("latin/dicrc", package = "gibasa"),
tempdir()
)
# write a csv file and compile it into a user dictionary
csv_file <- tempfile(fileext = ".csv")
writeLines(
c(
"qa, 0, 0, 5, \u304f\u3041",
"qi, 0, 0, 5, \u304f\u3043",
"qu, 0, 0, 5, \u304f",
"qe, 0, 0, 5, \u304f\u3047",
"qo, 0, 0, 5, \u304f\u3049"
),
csv_file
)
build_user_dic(
dic_dir = tempdir(),
file = (user_dic <- tempfile(fileext = ".dic")),
csv_file = csv_file,
encoding = "utf8"
)
# mocking a 'mecabrc' file to temporarily use the dictionary
withr::with_envvar(
c(
"MECABRC" = if (.Platform$OS.type == "windows") {
"nul"
} else {
"/dev/null"
},
"RCPP_PARALLEL_BACKEND" = "tinythread"
),
{
tokenize("quensan", sys_dic = tempdir(), user_dic = user_dic)
}
)
}
[Package gibasa version 1.1.1 Index]