kanjivec {kanjistat}R Documentation

Create kanjivec objects from kanjivg data

Description

Create a (list of) kanjivec object(s). Each object is a representation of the kanji as a tree of strokes based on .svg files from the KanjiVG database containing further, derived information.

Usage

kanjivec(
  kanji,
  database = NULL,
  flatten = "intelligent",
  bezier_discr = c("svgparser", "eqtimed", "eqspaced"),
  save = FALSE,
  overwrite = FALSE,
  simplify = TRUE
)

Arguments

kanji

a (vector of) character string(s) of one or several kanji.

database

the path to a local copy of (a subset of) the KanjiVG database. It is expected that the svg files reside at this exact location (not in a subdirectory). If NULL, an attempt is made to read the svg file(s) from the KanjiVG GitHub repository (after prompting for confirmation, which can be switched off via the option ask_github).

flatten

logical. Should nodes that are only-children be fused with their parents? Alternatively one of the strings "intelligent", "inner" or "leaves". Although the first is the default it is experimental and the precise meaning will change in the future; see details.

bezier_discr

character. How to discretize the Bézier curves describing the strokes. If "svgparser" (the only option available prior to kanjistat 0.12.0), code from the non-CRAN package svgparser is used for discretizing at equal time steps. The new choices "eqtimed" and "eqspaced" discretize into fewer points (and allow for more customization underneath). The former creates discretization points at equal time steps, the latter at equal distance steps (to a good approximation).

save

logical or character. If FALSE return the (list of) kanjivec object(s). Otherwise save the result as an rds file in the working directory (as kvecsave.rds) or under the file path provided.

overwrite

logical. If FALSE return an error (before any computations are done) if the designated file path already exists. Otherwise an existing file is overwritten.

simplify

logical. Shall a single kanjivec object be returned (instead a list of one) if kanji is a single kanji?

Details

A kanjivec object contains detailed information on the strokes of which an individual kanji is composed including their order, a segmentation into reasonable components ("radicals" in a more general sense of the word), classification of individual strokes, and both vector data and interpolated points to recreate the actual stroke in a Kyoukashou style font. For more information on the original data see http://kanjivg.tagaini.net/. That data is licenced under Creative Commons BY-SA 3.0 (see licence file of this package).

The original .svg files sometimes contain additional ⁠<g>⁠ elements that provide information about the current group of strokes rather than establishing a new subgroup of its own. This happens typically for information that establishes coherence with another part of the tree (by noting that the current subgroup is also part 2 of something else), but also for variant information. With the option flatten = TRUE the extra hierarchy level in the tree is avoided, while the original information in the KanjiVG file is kept. This is achieved by fusing only-children to their parents, giving the new node the name of the child and all its attributes, but prefixing p. to the attribute names of the parent (the parents' "names" attribute is discarded, but can be reconstructed from the parents' id). Removal of several hierarchies in sequence can lead to attribute names with multiple p. in front. Fusing to parents is suppressed if the parent is the root of the hierarchy (typically for one-stroke kanji), as this could lead to confusing results.

The options flatten = "inner" and flatten = "leaves" implement the above behavior only for the corresponding type of node (inner nodes or leaves). The option flatten = "intelligent" tries to find out in more sophisticated ways which flattening is desirable and which is not (it will flatten rather conservatively). Currently nodes without an element attribute that have only one child are flattened away (one example where this is reasonable is in kanji kbase[187, ]), as are nodes with an element attribute and only one child if this child is also an inner node and has the same element and part attribute as the parent, but both have no number (this would be problematic for any component-building code in the particular case of kanji kbase[1111, ]).

A kanjivec object has components

char

the kanji (a single character)

hex

its Unicode codepoint (integer of class hexmode)

padhex

the Unicode codepoint padded with zeros to five digits (mode character)

family

the font on which the data is based. Currently only "schoolbook" (to be extended with "kaisho" at some point)

nstrokes

the number of strokes in the kanji

ncompos

a vector of the number of components at each depth of the tree

nveins

the number of veins in the component structure

strokedend

the decomposition tree of the kanji as an object of class dendrogram

components

the component structure by segmentation depth (components can overlap) in terms of KanjiVG elements and their depth-first tree coordinates

veins

the veins in the component structure. Each vein is represented as a two-column matrix that lists in its rows the indices of components (starting at the root, which in the component indexing is c(1,1))

stroketree

the decomposition tree of the kanji, a list containing the full information of the the KanjiVG file (except some top level attributes)

stroketree is a close representation of the KanjiVG svg file as list object with some serious nesting of sublists. The XML attributes become attributes of the list and its elements. The user will usually not have to look at or manipulate stroketree directly, but strokedend and compents are derived from it and other functions may process it further.

The main differences to the svg file are

  1. the actual strokes are not only given as d-attributes describing Bézier curves, but but also as two-column matrices describing discretizations of these curves. These matrices are the actual contents of the innermost lists in stroketree, but are more conveniently accessed via the function get_strokes. Starting with version 0.13.0, there is also an additional attribute "beziermat", which describes the Bézier curves for the stroke in a 2 x (1+3n) matrix format. The first column is the start point, then each triplet of columns stands for control point 1, control point 2 and end point (=start point of the next Bézier curve if any).

  2. The positions of the stroke numbers (for plotting) are saved as an attribute strokenum_coords to the entire stroke tree rather than a separate element.

strokedend is more easy to examine and work with due to various convenience functions for dendrograms in the packages stats and dendextend, including str and plot.dendrogram. The function plot.kanjivec with option type = "dend" is a wrapper for plot.dendrogram with reasonable presets for various options.

The label-attributes of the nodes of strokedend are taken from the element (for inner nodes) and type (for leaves) attributes of the .svg files. They consist of UTF-8 characters representing kanji parts and a combination of UTF-8 characters for representing strokes and may not represent well in all CJK fonts (see details of plot.kanjivec). If element and type are missing in the .svg file, the label assigned is the second part of the id-attribute, e.g. g5 or s9.

The components at a given level can be plotted, see plot.kanjivec with type = "kanji". Both components and veins serve mainly for the computation of kanji distances.

Value

A list of objects of class kanjivec or, if only one kanji was specified and simplify is TRUE, a single objects of class kanjivec. If save = TRUE, the same is (saved and) still returned invisibly.

See Also

plot.kanjivec, str.kanjivec

Examples

if (interactive()) {
  # Try to load the svg file for the kanji from GitHub.
  res <- kanjivec("\u85e4", database=NULL)
  str(res)
}

fivebetas  # sample kanjivec data
str(fivebetas[[1]])


[Package kanjistat version 0.14.1 Index]