encoding {tau}R Documentation

Adapt the (Declared) Encoding of a Character Vector

Description

Functions for testing and adapting the (declared) encoding of the components of a vector of mode character.

Usage

is.utf8(x)
is.ascii(x)
is.locale(x)

translate(x, recursive = FALSE, internal = FALSE)
fixEncoding(x, latin1 = FALSE)

Arguments

x

a vector (of character).

recursive

option to process list components.

internal

option to use internal translation.

latin1

option to assume "latin1" if the declared encoding is "unknown".

Details

is.utf8 tests if the components of a vector of character are true UTF-8 strings, i.e. contain one or more valid UTF-8 multi-byte sequence(s).

is.locale tests if the components of a vector of character are in the encoding of the current locale.

translate encodes the components of a vector of character in the encoding of the current locale. This includes the names attribute of vectors of arbitrary mode. If recursive = TRUE the components of a list are processed. If internal = TRUE multi-byte sequences that are invalid in the encoding of the current locale are changed to literal hex numbers (see FIXME).

fixEncoding sets the declared encoding of the components of a vector of character to their correct or preferred values. If latin1 = TRUE strings that are not valid UTF-8 strings are declared to be in "latin1". On the other hand, strings that are true UTF-8 strings are declared to be in "UTF-8" encoding.

Value

The same type of object as x with the (declared) encoding possibly changed.

Note

Currently translate uses iconv and therefore is not guaranteed to work on all platforms.

Author(s)

Christian Buchta

References

FIXME PCRE, RFC 3629

See Also

Encoding and iconv.

Examples

## Note that we assume R runs in an UTF-8 locale
text <- c("aa", "a\xe4")
Encoding(text) <- c("unknown", "latin1")
is.utf8(text)
is.ascii(text)
is.locale(text)
## implicit translation
text
##
t1 <- iconv(text, from = "latin1", to = "UTF-8")
Encoding(t1)
## oops
t2 <- iconv(text, from = "latin1", to = "utf-8")
Encoding(t2)
t2
is.locale(t2)
##
t2 <- fixEncoding(t2)
Encoding(t2)
## explicit translation
t3 <- translate(text)
Encoding(t3)

[Package tau version 0.0-25 Index]