| kanjivec {kanjistat} | R Documentation |
Create kanjivec objects from kanjivg data
Description
Create a (list of) kanjivec object(s). Each object is a representation of the kanji as a tree of strokes based on .svg files from the KanjiVG database containing further, derived information.
Usage
kanjivec(
kanji,
database = NULL,
flatten = "intelligent",
bezier_discr = c("svgparser", "eqtimed", "eqspaced"),
save = FALSE,
overwrite = FALSE,
simplify = TRUE
)
Arguments
kanji |
a (vector of) character string(s) of one or several kanji. |
database |
the path to a local copy of (a subset of) the KanjiVG database. It is expected
that the svg files reside at this exact location (not in a subdirectory). If |
flatten |
logical. Should nodes that are only-children be fused with their parents? Alternatively one of the strings "intelligent", "inner" or "leaves". Although the first is the default it is experimental and the precise meaning will change in the future; see details. |
bezier_discr |
character. How to discretize the Bézier curves describing the strokes. If "svgparser" (the only option available prior to kanjistat 0.12.0), code from the non-CRAN package svgparser is used for discretizing at equal time steps. The new choices "eqtimed" and "eqspaced" discretize into fewer points (and allow for more customization underneath). The former creates discretization points at equal time steps, the latter at equal distance steps (to a good approximation). |
save |
logical or character. If FALSE return the (list of) kanjivec object(s). Otherwise save the result as an rds file in the working directory (as kvecsave.rds) or under the file path provided. |
overwrite |
logical. If FALSE return an error (before any computations are done) if the designated file path already exists. Otherwise an existing file is overwritten. |
simplify |
logical. Shall a single kanjivec object be returned (instead a list of one) if |
Details
A kanjivec object contains detailed information on the strokes of which an individual kanji is composed including their order, a segmentation into reasonable components ("radicals" in a more general sense of the word), classification of individual strokes, and both vector data and interpolated points to recreate the actual stroke in a Kyoukashou style font. For more information on the original data see http://kanjivg.tagaini.net/. That data is licenced under Creative Commons BY-SA 3.0 (see licence file of this package).
The original .svg files sometimes contain additional <g> elements that provide
information about the current group of strokes rather than establishing a new subgroup
of its own. This happens typically for information that establishes coherence with another
part of the tree (by noting that the current subgroup is also part 2 of something else),
but also for variant information. With the option flatten = TRUE the extra hierarchy
level in the tree is avoided, while the original information in the KanjiVG file is kept.
This is achieved by fusing only-children to their parents, giving the new node the name
of the child and all its attributes, but prefixing p. to the attribute names
of the parent (the parents' "names" attribute is discarded, but can be reconstructed from
the parents' id). Removal of several hierarchies in sequence can lead to attribute names
with multiple p. in front. Fusing to parents is suppressed if the parent is the
root of the hierarchy (typically for one-stroke kanji), as this could lead to confusing
results.
The options flatten = "inner" and flatten = "leaves" implement the above behavior
only for the corresponding type of node (inner nodes or leaves). The option
flatten = "intelligent" tries to find out in more sophisticated ways which flattening
is desirable and which is not (it will flatten rather conservatively). Currently nodes without
an element attribute that have only one child are flattened away (one example where this is
reasonable is in kanji kbase[187, ]), as are nodes with an element attribute and only
one child if this child is also an inner node and has the same element and part attribute as the
parent, but both have no number (this would be problematic for any component-building code
in the particular case of kanji kbase[1111, ]).
A kanjivec object has components
charthe kanji (a single character)
hexits Unicode codepoint (integer of class
hexmode)padhexthe Unicode codepoint padded with zeros to five digits (mode character)
familythe font on which the data is based. Currently only "schoolbook" (to be extended with "kaisho" at some point)
nstrokesthe number of strokes in the kanji
ncomposa vector of the number of components at each depth of the tree
nveinsthe number of veins in the component structure
strokedendthe decomposition tree of the kanji as an object of class
dendrogramcomponentsthe component structure by segmentation depth (components can overlap) in terms of KanjiVG elements and their depth-first tree coordinates
veinsthe veins in the component structure. Each vein is represented as a two-column matrix that lists in its rows the indices of
components(starting at the root, which in the component indexing isc(1,1))stroketreethe decomposition tree of the kanji, a list containing the full information of the the KanjiVG file (except some top level attributes)
stroketree is a close representation of the KanjiVG svg file as list object with
some serious nesting of sublists. The XML attributes become attributes of the list and its elements.
The user will usually not have to look at or manipulate stroketree directly, but
strokedend and compents are derived from it and other functions may process it
further.
The main differences to the svg file are
the actual strokes are not only given as d-attributes describing Bézier curves, but but also as two-column matrices describing discretizations of these curves. These matrices are the actual contents of the innermost lists in
stroketree, but are more conveniently accessed via the functionget_strokes. Starting with version 0.13.0, there is also an additional attribute "beziermat", which describes the Bézier curves for the stroke in a 2 x (1+3n) matrix format. The first column is the start point, then each triplet of columns stands for control point 1, control point 2 and end point (=start point of the next Bézier curve if any).The positions of the stroke numbers (for plotting) are saved as an attribute strokenum_coords to the entire stroke tree rather than a separate element.
strokedend is more easy to examine and work with due to various convenience functions for
dendrograms in the packages stats and dendextend, including str
and plot.dendrogram. The function plot.kanjivec with option
type = "dend" is a wrapper for plot.dendrogram with reasonable presets
for various options.
The label-attributes of the nodes of strokedend are taken from the element (for inner nodes)
and type (for leaves) attributes of the .svg files. They consist of UTF-8 characters representing
kanji parts and a combination of UTF-8 characters for representing strokes and may not represent
well in all CJK fonts (see details of plot.kanjivec). If element and type are missing
in the .svg file, the label assigned is the second part of the id-attribute, e.g. g5 or s9.
The components at a given level can be plotted, see plot.kanjivec with
type = "kanji". Both components and veins serve mainly for the computation
of kanji distances.
Value
A list of objects of class kanjivec or, if only one kanji was specified and
simplify is TRUE, a single objects of class kanjivec. If save = TRUE,
the same is (saved and) still returned invisibly.
See Also
Examples
if (interactive()) {
# Try to load the svg file for the kanji from GitHub.
res <- kanjivec("\u85e4", database=NULL)
str(res)
}
fivebetas # sample kanjivec data
str(fivebetas[[1]])