R: Read a KANJIDIC2 file

read_kanjidic2 {kanjistat}

R Documentation

Read a KANJIDIC2 file

Description

Perform basic validity checks and transform data to a standardized list or keep as an object of class xml_document (package xml2).

Usage

read_kanjidic2(fpath = NULL, output = c("list", "xml"))

Arguments

`fpath`	the path to a local KANJIDIC2 file. If `NULL` (the default) the most recent KANJIDIC2 file is downloaded from https://www.edrdg.org/kanjidic/kanjidic2.xml.gz after asking for confirmation.
`output`	one of `"list"` or `"xml"`. The desired type of output.

Details

KANJIDIC2 contains detailed information on all of the 13108 kanji in three main Japanese standards (JIS X 0208, 0212 and 0213). The KANJIDIC files have been compiled and maintained by Jim Breen since 1991, with the help of various other people. The copyright is now held by the Electronic Dictionary Research and Development Group (EDRDG). The files are made available under the Creative Commons BY-SA 4.0 license. See https://www.edrdg.org/wiki/index.php/KANJIDIC_Project for details on the contents of the files and their license.

If output = "xml", some minimal checks are performed (high level structure and total number of kanji).

If output = "list", additional validity checks of the lower level structure are performed. Most are in accordance with the file's Document Type Definition (DTD). Some additional check concern some common patterns that are true about the current KANJIDIC2 file (as of December 2023) and seem unlikely to change in the near future. This includes that there is always at most one rmgroup entry in reading_meaning. Informative warnings are provided if any of these additional checks fail.

Value

If output = "xml", the exact XML document obtained from xml2::read_xml. If output = "list", a list of lists (the individual kanji), each with the following seven components.

literal: a single UTF-8 character representing the kanji.
codepoint: a named character vector giving the available codepoints in the unicode and jis standards.
radical: a named numeric vector giving the radical number(s), in the range 1 to 214. The number named classical is as recorded in the KangXi Zidian (1716); if there is a number named nelson_c, the kanji was reclassified in Nelson's Modern Reader's Japanese-English Character Dictionary (1962/74).
misc: a list with six components
- grade: the kanji grade level. 1 through 6 indicates a kyouiku kanji and the grade in which the kanji is taught in Japanese primary school. 8 indicates one of the remaining jouyou kanji learned in junior high school, and 9 or 10 are jinmeiyou kanji. The remaining (hyougai) kanji have NA as their entry.
- stroke_count: The stroke count of the kanji, including the radical. If more than one, the first is considered the accepted count, while subsequent ones are common miscounts.
- variant: a named character vector giving either a cross-reference code to another kanji, usually regarded as a variant, or an alternative indexing code for the current kanji. The type of variant is given in the name.
- freq: the frequency rank (1 = most frequent) based on newspaper data. NA if not among the 2500 most frequent.
- rad_name: a character vector. For a kanji that is a radical itself, the name(s) of the radical (if there are any), otherwise of length 0.
- jlpt: The Japanese Language Proficiency Test level according to the old four-level system that was in place before 2010. A value from 4 (most elementary) to 1 (most advanced).
dic_number: a named character vector (possibly of length 0) giving the index numbers (for some kanji with letters attached) of the kanji in various dictionaries, textbooks and flashcard collections (specified by the name). For Morohashi's Dai Kan-Wa Jiten, the volume and page number is also provided in the format moro.VOL.PAGE.
query_code: a named character vector giving the codes of the kanji in various query systems (specified by the name). For Halpern's SKIP code, possible misclassifications (if any) of the kanji are also noted in the format mis.skip.TYPE, where TYPE indicates the type of misclassification.
reading_meaning: a (possibly empty) list containing zero or more rmgroup components creating groups of readings and meanings (in practice there is never more than one rmgroup currently) as well as a component nanori giving a character vector (possibly of length 0) of readings only associated with names. Each rmgroup is a list with entries:
- reading: a (possibly empty) list of entries named from among pinyin, korean_r, korean_h, vietnam, ja_on and ja_kun, each containing a character vector of the corresponding readings
- meaning: a (possibly empty) list of entries named with two-letter (ISO 639-1) language codes, each containing a character vector of the corresponding meanings.

Examples

if (interactive()) {
  read_kanjidic2("kanjidic2.xml")
}

[Package kanjistat version 0.14.1 Index]