u_char_basics {Unicode} | R Documentation |
Unicode Character Objects
Description
Data structures and basic methods for Unicode character data.
Usage
as.u_char(x)
as.u_char_range(x)
as.u_char_seq(x, sep = NA_character_)
Arguments
x |
R objects coercible to the respective Unicode character data types, see Details. |
sep |
a character string. |
Details
Package Unicode provides three basic classes for representing
Unicode characters: u_char
for vectors of Unicode characters,
u_char_range
for vectors of Unicode character ranges, and
u_char_seq
for vectors of Unicode character sequences. Objects
from these classes are created via the respective coercion functions.
as.u_char
knows to coerce integers or hex strings (with or
without a leading ‘0x’ or the ‘U+’ typically used for
Unicode characters) giving the corresponding code points. It can also
handle Unicode character ranges, flattening them out into the
corresponding vector of Unicode characters. To “coerce” a
UTF-8 encoded R character string to the corresponding Unicode
character object, use coercion on the result of obtaining the integer
code points via utf8ToInt
.
as.u_char_range
knows to coerce character strings of single
Unicode characters or a Unicode range expression with the hex codes of
two Unicode characters collapsed by ‘..’ (currently, hard-wired).
It can also handle u_char
objects, coercing them to ranges of
single code points.
as.u_char_seq
knows to coerce character strings with the hex
codes of Unicode characters collapsed by a non-empty sep
. The
default corresponds to using ‘,’ if the strings use surrounding
angles, and ‘ ’ otherwise. If sep
is empty or has length
zero, the character strings are used as is, re-encoded in UTF-8 if
necessary, and mapped to the corresponding Unicode character sequences
using utf8ToInt
. as.u_char_seq
can also handle
Unicode character ranges (giving the corresponding flattened out
Unicode character sequences), or lists of objects coercible to Unicode
characters via as.u_char
.
All classes currently have as.character
, as.data.frame
,
c
, format
, print
, rep
, unique
and
[
subscript methods. More methods will be added eventually.
Value
For as.u_char
, a u_char
object giving a vector of
Unicode characters.
For as.u_char_range
, a u_char_range
object giving a
vector of Unicode character ranges.
For as.u_char_seq
, a u_char_seq
object giving a
vector of Unicode character sequences.
References
Unicode Character Database (https://www.unicode.org/ucd/),
https://en.wikipedia.org/wiki/Unicode
Examples
x <- as.u_char_range(c("00AA..00AC", "01CC"))
x
## Corresponding Unicode character sequence object:
as.u_char_seq(x)
## Corresponding Unicode character object with all code points:
as.u_char(x)
## Inspect all Unicode characters in the range:
u_char_inspect(x)
## Turning R character strings into the respective Unicode character
## sequences:
as.u_char_seq(c("Austria", "Trantor"), "")
## which can then be subscripted "as usual", e.g.:
x <- as.u_char_seq(c("Austria", "Trantor"), "")[[1L]][c(3L, 5L)]
x
## To reassemble the character strings:
intToUtf8(x)