read_kanjidic2 {kanjistat} | R Documentation |
Read a KANJIDIC2 file
Description
Perform basic validity checks and transform data to a standardized list or keep as an object of
class xml_document
(package xml2
).
Usage
read_kanjidic2(fpath = NULL, output = c("list", "xml"))
Arguments
fpath |
the path to a local KANJIDIC2 file. If |
output |
one of |
Details
KANJIDIC2 contains detailed information on all of the 13108 kanji in three main Japanese standards (JIS X 0208, 0212 and 0213). The KANJIDIC files have been compiled and maintained by Jim Breen since 1991, with the help of various other people. The copyright is now held by the Electronic Dictionary Research and Development Group (EDRDG). The files are made available under the Creative Commons BY-SA 4.0 license. See https://www.edrdg.org/wiki/index.php/KANJIDIC_Project for details on the contents of the files and their license.
If output = "xml"
, some minimal checks are performed (high level structure and
total number of kanji).
If output = "list"
, additional validity checks of the lower level structure are performed.
Most are in accordance with the file's Document Type Definition (DTD).
Some additional check concern some common patterns that are true about the current
KANJIDIC2 file (as of December 2023) and seem unlikely to change in the near future.
This includes that there is always at most one rmgroup
entry in reading_meaning
.
Informative warnings are provided if any of these additional checks fail.
Value
If output = "xml"
, the exact XML document obtained from xml2::read_xml. If output = "list"
, a list of
lists (the individual kanji), each with the following seven components.
-
literal
: a single UTF-8 character representing the kanji. -
codepoint
: a named character vector giving the available codepoints in the unicode and jis standards. -
radical
: a named numeric vector giving the radical number(s), in the range 1 to 214. The number namedclassical
is as recorded in the KangXi Zidian (1716); if there is a number namednelson_c
, the kanji was reclassified in Nelson's Modern Reader's Japanese-English Character Dictionary (1962/74). -
misc
: a list with six components-
grade
: the kanji grade level. 1 through 6 indicates a kyouiku kanji and the grade in which the kanji is taught in Japanese primary school. 8 indicates one of the remaining jouyou kanji learned in junior high school, and 9 or 10 are jinmeiyou kanji. The remaining (hyougai) kanji haveNA
as their entry. -
stroke_count
: The stroke count of the kanji, including the radical. If more than one, the first is considered the accepted count, while subsequent ones are common miscounts. -
variant
: a named character vector giving either a cross-reference code to another kanji, usually regarded as a variant, or an alternative indexing code for the current kanji. The type of variant is given in the name. -
freq
: the frequency rank (1 = most frequent) based on newspaper data.NA
if not among the 2500 most frequent. -
rad_name
: a character vector. For a kanji that is a radical itself, the name(s) of the radical (if there are any), otherwise of length 0. -
jlpt
: The Japanese Language Proficiency Test level according to the old four-level system that was in place before 2010. A value from 4 (most elementary) to 1 (most advanced).
-
-
dic_number
: a named character vector (possibly of length 0) giving the index numbers (for some kanji with letters attached) of the kanji in various dictionaries, textbooks and flashcard collections (specified by the name). For Morohashi's Dai Kan-Wa Jiten, the volume and page number is also provided in the format moro.VOL.PAGE. -
query_code
: a named character vector giving the codes of the kanji in various query systems (specified by the name). For Halpern's SKIP code, possible misclassifications (if any) of the kanji are also noted in the format mis.skip.TYPE, where TYPE indicates the type of misclassification. -
reading_meaning
: a (possibly empty) list containing zero or morermgroup
components creating groups of readings and meanings (in practice there is never more than onermgroup
currently) as well as a componentnanori
giving a character vector (possibly of length 0) of readings only associated with names. Eachrmgroup
is a list with entries:-
reading
: a (possibly empty) list of entries named from amongpinyin
,korean_r
,korean_h
,vietnam
,ja_on
andja_kun
, each containing a character vector of the corresponding readings -
meaning
: a (possibly empty) list of entries named with two-letter (ISO 639-1) language codes, each containing a character vector of the corresponding meanings.
-
See Also
Examples
if (interactive()) {
read_kanjidic2("kanjidic2.xml")
}