R: Determine the Width of Code Points

stri_width {stringi}

R Documentation

Determine the Width of Code Points

Description

Approximates the number of text columns the 'cat()' function might use to print a string using a mono-spaced font.

Usage

stri_width(str)

Arguments

str

character vector or an object coercible to

Details

The Unicode standard does not formalize the notion of a character width. Roughly based on http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c, https://github.com/nodejs/node/blob/master/src/node_i18n.cc, and UAX #11 we proceed as follows. The following code points are of width 0:

code points with general category (see stringi-search-charclass) Me, Mn, and Cf),
C0 and C1 control codes (general category Cc) - for compatibility with the nchar function,
Hangul Jamo medial vowels and final consonants (code points with enumerable property UCHAR_HANGUL_SYLLABLE_TYPE equal to U_HST_VOWEL_JAMO or U_HST_TRAILING_JAMO; note that applying the NFC normalization with stri_trans_nfc is encouraged),
ZERO WIDTH SPACE (U+200B),

Characters with the UCHAR_EAST_ASIAN_WIDTH enumerable property equal to U_EA_FULLWIDTH or U_EA_WIDE are of width 2.

Most emojis and characters with general category So (other symbols) are of width 2.

SOFT HYPHEN (U+00AD) (for compatibility with nchar) as well as any other characters have width 1.

Value

Returns an integer vector of the same length as str.

Author(s)

Marek Gagolewski and other contributors

References

East Asian Width – Unicode Standard Annex #11, https://www.unicode.org/reports/tr11/

Examples

stri_width(LETTERS[1:5])
stri_width(stri_trans_nfkd('\u0105'))
stri_width(stri_trans_nfkd('\U0001F606'))
stri_width( # Full-width equivalents of ASCII characters:
   stri_enc_fromutf32(as.list(c(0x3000, 0xFF01:0xFF5E)))
)
stri_width(stri_trans_nfkd('\ubc1f')) # includes Hangul Jamo medial vowels and final consonants

[Package stringi version 1.8.4 Index]