kanjidist {kanjistat} | R Documentation |
Compute distance between two kanjivec objects based on hierarchical optimal transport
Description
The kanji distance is based on matching hierarchical component structures in a nesting-free way across all levels. The cost for matching individual components is a cost for registering the components (i.e. alligning there position, scale and aspect ratio) plus the (relative unbalanced) Wasserstein distance between the registered components.
Usage
kanjidist(
k1,
k2,
compo_seg_depth1 = 3,
compo_seg_depth2 = 3,
p = 1,
C = 0.2,
approx = c("grid", "pc", "pcweighted"),
type = c("rtt", "unbalanced", "balanced"),
size = 48,
lwd = 2.5,
density = 30,
verbose = FALSE,
minor_warnings = TRUE
)
Arguments
k1 , k2 |
two objects of type |
compo_seg_depth1 , compo_seg_depth2 |
two integers |
p |
the order of the Wasserstein distance used for matching components. All distances and
the penalty (if any) are taken
to the |
C |
the penalty for extra mass if |
approx |
what kind of approximation is used for matching components. If this is |
type |
the type of Wasserstein distance used for matching components based on the grid or
point cloud approximation chosen. |
size |
side length of the bitmaps used for matching components (if |
lwd |
linewidth for drawing the components in these bitmaps (if |
density |
approximate number of discretization points per unit line length (if |
verbose |
logical. Whether to print detailed information on the cost for all pairs of components and the final matching. |
minor_warnings |
logical. Should minor_warnings be given. If |
Details
For the precise definition and details see the reference below. Parameter C
corresponds to b/2^{1/p}
in the paper.
Value
The kanji distance, a non-negative number.
Warning
The interface and details of this function will change in the future. Currently only a minimal
set of parameters can be passed. The other parameters are fixed exactly as in the
"prototype distance" (4.1) of the reference below for better or worse.
There is a certain
tendency that exact matches of components are rather strongly favored (if the KanjiVG elements
agree this can overrule the unbalanced Wasserstein distance) and the penalties for
translation/scaling/distortion of components are somewhat mild.
The computation time is rather high (depending on the settings and kanji up to several
seconds per kanji pair). This can be alleviated somewhat by keeping the compo_seg_depth
parameters at 3 or lower and setting size = 32
(which goes well with lwd=1.8
).
Future versions will use a much faster line base optimal transport algorithm and further
speed-ups.
References
Dominic Schuhmacher (2023).
Distance maps between Japanese kanji characters based on hierarchical optimal transport.
ArXiv, doi:10.48550/arXiv.2304.02493
See Also
Examples
if (requireNamespace("ROI.plugin.glpk")) {
kanjidist(fivebetas[[4]], fivebetas[[5]])
kanjidist(fivebetas[[4]], fivebetas[[5]], verbose=TRUE)
# faster and similar:
kanjidist(fivebetas[[4]], fivebetas[[5]], compo_seg_depth1=2, compo_seg_depth2=2,
size=32, lwd=1.8, verbose=TRUE)
# slower and similar:
kanjidist(fivebetas[[4]], fivebetas[[5]], size=64, lwd=3.2, verbose=TRUE)
}