kanjivec {kanjistat} | R Documentation |
Create kanjivec objects from kanjivg data
Description
Create a (list of) kanjivec object(s). Each object is a representation of the kanji as a tree of strokes based on .svg files from the KanjiVG database containing further, derived information.
Usage
kanjivec(
kanji,
database = NULL,
flatten = "intelligent",
bezier_discr = c("svgparser", "eqtimed", "eqspaced"),
save = FALSE,
overwrite = FALSE,
simplify = TRUE
)
Arguments
kanji |
a (vector of) character string(s) of one or several kanji. |
database |
the path to a local copy of (a subset of) the KanjiVG database. It is expected
that the svg files reside at this exact location (not in a subdirectory). If |
flatten |
logical. Should nodes that are only-children be fused with their parents? Alternatively one of the strings "intelligent", "inner" or "leaves". Although the first is the default it is experimental and the precise meaning will change in the future; see details. |
bezier_discr |
character. How to discretize the Bézier curves describing the strokes. If "svgparser" (the only option available prior to kanjistat 0.12.0), code from the non-CRAN package svgparser is used for discretizing at equal time steps. The new choices "eqtimed" and "eqspaced" discretize into fewer points (and allow for more customization underneath). The former creates discretization points at equal time steps, the latter at equal distance steps (to a good approximation). |
save |
logical or character. If FALSE return the (list of) kanjivec object(s). Otherwise save the result as an rds file in the working directory (as kvecsave.rds) or under the file path provided. |
overwrite |
logical. If FALSE return an error (before any computations are done) if the designated file path already exists. Otherwise an existing file is overwritten. |
simplify |
logical. Shall a single kanjivec object be returned (instead a list of one) if |
Details
A kanjivec object contains detailed information on the strokes of which an individual kanji is composed including their order, a segmentation into reasonable components ("radicals" in a more general sense of the word), classification of individual strokes, and both vector data and interpolated points to recreate the actual stroke in a Kyoukashou style font. For more information on the original data see http://kanjivg.tagaini.net/. That data is licenced under Creative Commons BY-SA 3.0 (see licence file of this package).
The original .svg files sometimes contain additional <g>
elements that provide
information about the current group of strokes rather than establishing a new subgroup
of its own. This happens typically for information that establishes coherence with another
part of the tree (by noting that the current subgroup is also part 2 of something else),
but also for variant information. With the option flatten = TRUE
the extra hierarchy
level in the tree is avoided, while the original information in the KanjiVG file is kept.
This is achieved by fusing only-children to their parents, giving the new node the name
of the child and all its attributes, but prefixing p.
to the attribute names
of the parent (the parents' "names" attribute is discarded, but can be reconstructed from
the parents' id). Removal of several hierarchies in sequence can lead to attribute names
with multiple p.
in front. Fusing to parents is suppressed if the parent is the
root of the hierarchy (typically for one-stroke kanji), as this could lead to confusing
results.
The options flatten = "inner"
and flatten = "leaves"
implement the above behavior
only for the corresponding type of node (inner nodes or leaves). The option
flatten = "intelligent"
tries to find out in more sophisticated ways which flattening
is desirable and which is not (it will flatten rather conservatively). Currently nodes without
an element attribute that have only one child are flattened away (one example where this is
reasonable is in kanji kbase[187, ]
), as are nodes with an element attribute and only
one child if this child is also an inner node and has the same element and part attribute as the
parent, but both have no number (this would be problematic for any component-building code
in the particular case of kanji kbase[1111, ]
).
A kanjivec
object has components
char
the kanji (a single character)
hex
its Unicode codepoint (integer of class
hexmode
)padhex
the Unicode codepoint padded with zeros to five digits (mode character)
family
the font on which the data is based. Currently only "schoolbook" (to be extended with "kaisho" at some point)
nstrokes
the number of strokes in the kanji
ncompos
a vector of the number of components at each depth of the tree
nveins
the number of veins in the component structure
strokedend
the decomposition tree of the kanji as an object of class
dendrogram
components
the component structure by segmentation depth (components can overlap) in terms of KanjiVG elements and their depth-first tree coordinates
veins
the veins in the component structure. Each vein is represented as a two-column matrix that lists in its rows the indices of
components
(starting at the root, which in the component indexing isc(1,1)
)stroketree
the decomposition tree of the kanji, a list containing the full information of the the KanjiVG file (except some top level attributes)
stroketree
is a close representation of the KanjiVG svg file as list object with
some serious nesting of sublists. The XML attributes become attributes of the list and its elements.
The user will usually not have to look at or manipulate stroketree
directly, but
strokedend
and compents
are derived from it and other functions may process it
further.
The main differences to the svg file are
the actual strokes are not only given as d-attributes describing Bézier curves, but but also as two-column matrices describing discretizations of these curves. These matrices are the actual contents of the innermost lists in
stroketree
, but are more conveniently accessed via the functionget_strokes
. Starting with version 0.13.0, there is also an additional attribute "beziermat", which describes the Bézier curves for the stroke in a 2 x (1+3n) matrix format. The first column is the start point, then each triplet of columns stands for control point 1, control point 2 and end point (=start point of the next Bézier curve if any).The positions of the stroke numbers (for plotting) are saved as an attribute strokenum_coords to the entire stroke tree rather than a separate element.
strokedend
is more easy to examine and work with due to various convenience functions for
dendrograms in the packages stats
and dendextend
, including str
and plot.dendrogram
. The function plot.kanjivec
with option
type = "dend"
is a wrapper for plot.dendrogram
with reasonable presets
for various options.
The label-attributes of the nodes of strokedend
are taken from the element (for inner nodes)
and type (for leaves) attributes of the .svg files. They consist of UTF-8 characters representing
kanji parts and a combination of UTF-8 characters for representing strokes and may not represent
well in all CJK fonts (see details of plot.kanjivec
). If element and type are missing
in the .svg file, the label assigned is the second part of the id-attribute, e.g. g5 or s9.
The components
at a given level can be plotted, see plot.kanjivec
with
type = "kanji"
. Both components
and veins
serve mainly for the computation
of kanji distances.
Value
A list of objects of class kanjivec
or, if only one kanji was specified and
simplify
is TRUE
, a single objects of class kanjivec
. If save = TRUE
,
the same is (saved and) still returned invisibly.
See Also
Examples
if (interactive()) {
# Try to load the svg file for the kanji from GitHub.
res <- kanjivec("\u85e4", database=NULL)
str(res)
}
fivebetas # sample kanjivec data
str(fivebetas[[1]])