R: Compute the stroke edit distances between two sets of kanji

sedist {kanjistat}

R Documentation

Compute the stroke edit distances between two sets of kanji

Description

Variants of the stroke edit distance proposed by Yencken (2010). Each kanji is encoded as sequence of stroke types according to its stroke order, using the type attribute from the kanjiVG data. Then the edit distance (a.k.a.\ Levenshtein distance) between sequences is computed and divided by the maximum of the number of strokes

Usage

sedist(k1, k2, type = c("full", "before_slash", "first"))

Arguments

`k1`, `k2`	atomic vectors or lists of kanji in any format that can be treated by `convert_kanji()`
`type`	the type of stroke edit distance to compute. See details.

Details

The kanjiVG type attribute is a single string composed of a CJK strokes Unicode character, an optional latin letter providing further information and possibly a variant (another CJK strokes character with optional letter) separated by "/". If type is "full"' a match is only counted if two strings are exactly the same, "before_slash" ignores any slashes and what comes after them, "first" only considers the first character of each string (so the first CJK stroke character) when counting matches.

The stroke edit distance used by Yencken (2010) is obtained by setting type = "all" (the default), except that the underlying kanjiVG data has significantly changed since then. Comparing with the values in dstrokedit we get an agreement of 96.3 percent, whereas the other distances disagree by a small amount (usually 1-2 edit operations).

Value

A length(k1) x length(k2) matrix of stroke edit distances.

Warning

Requires kanjistat.data package.

References

Yencken, Lars (2010). Orthographic support for passing the reading hurdle in Japanese.
PhD Thesis, University of Melbourne, Australia

Examples

ind1 <- 384  
k1 <- convert_kanji(ind1, "character")
ind2 <- which(dstrokedit[ind1,] > 0)  
# dstrokedit contains only the "closest" kanji
k2 <- convert_kanji(ind2, "character")
row_a <- dstrokedit[ind1, ind2]  
if (requireNamespace("kanjistat.data", quietly = TRUE)) {
  row_b <- sedist(k1, k2)  
  mat <- rbind(row_a, row_b)
  rownames(mat) = c(k1, k1)
  colnames(mat) = k2
  mat
}

[Package kanjistat version 0.14.1 Index]