decodeVN {vietnameseConverter} | R Documentation |
Convert characters from legacy Vietnamese encodings to UTF-8 encoding
Description
Convert characters from legacy Vietnamese encodings to UTF-8 encoding
Usage
decodeVN(
x,
from = c("TCVN3", "VISCII", "VPS", "Unicode"),
to = c("Unicode", "TCVN3", "VISCII", "VPS"),
diacritics = TRUE
)
Arguments
x |
data.frame, sf object, or character vector |
from |
Text encoding of input x |
to |
Text encoding of output |
diacritics |
logical. Preserve diacritics (TRUE) or not (FALSE)? |
Details
Many characters in legacy Vietnamese encodings (e.g. TCVN3, VPS, VISCII) are not read correctly in R, particularly those with diacritics (accents). The particular encodings don't seem to be supported by R, at least on many locales. When R reads them as if they have UTF-8 encoding, it will result in wrong characters being printed and garbled text (Mojibake - see vignette and examples below).
This functions converts character vectors to from various Vietnamese legacy encodings to readable
Unicode characters in UTF-8 encoding. By default the function attempts the conversion from TCVN3 to Unicode
while preserving the diacritics, but also supports other Vietnamese encodings (TCVN3, VPS, VISCII - via argument from
).
Currently VNI and VNU are not supported.
It works on data frames, spatial objects (from the sf package), and character vectors.
diacritics = TRUE
will return characters with their diacritics. With diacritics = FALSE
,
the output will be ASCII letters without diacritics. Upper/lower case will be preserved regardless.
The internal search and replace is performed by the gsubfn
function from the gsubfn package. It performs simple character replacements to fix the text.
Currently the function converts from the Vietnamese encodings to Unicode, not vice versa. Please contact the maintainer if the conversion from Unicode to Vietnamese encodings would be relevant for you.
The character conversion table was adapted from http://vietunicode.sourceforge.net/charset/.
Value
character string or data frame (depending on x)
Warning
When printing a data frame with Unicode characters using the standard print method, the R console will show the Unicode escape characters (e.g. "<U+1EA3>") instead of the actual Unicode characters. This is a limitation of the R console. The data are correct and will show correctly when using e.g. View() or when printing columns as vectors.
Examples
# First we produce the wrongly formatted character string
# using Unicode symbols is only necessary to create a portable example in the R package
# you don't need to use Unicode characters like this in your data
string <- c("Qu\u00B6ng Tr\u00DE", "An \u00A7\u00ABn", "Th\u00F5a Thi\u00AAn Hu\u00D5")
# Below we have a look at the wrongly formatted character string.
# This is what it would look like when you load TCVN3 encoded data as UTF8
string
# convert character vector from TCVN3 > UTF-8
decodeVN(string)
decodeVN(string, diacritics = FALSE)
# # convert data frame columns from TCVN3 > UTF-8
df <- data.frame(id = c(1,2,3),
name = string)
df_decode <- decodeVN(df)
df_decode
# NOTE: some characters may be displayed as unicode in the R console
# check the individual column to see if they are correct:
df_decode[,2]
decodeVN(df, diacritics = FALSE)
# using the built-in sample data
data(vn_samples)
decodeVN(vn_samples$TCVN3) # TCVN -> Unicode # TCVN3 -> Unicode
decodeVN(vn_samples$TCVN3, diacritics = FALSE) # TCVN3 -> Unicode (ASCII characters only)
decodeVN(vn_samples$VISCII, from = "VISCII") # VISCII -> Unicode
# Demonstration for sf object
# create sf object (just for demonstration)
require(sf)
df_geom <- st_sfc(st_point(c(3,4)), st_point(c(10,11)), st_point(c(15,13)))
df_spatial <- st_set_geometry(df, df_geom)
# convert Vietnamese characters
df_spatial_decode <- decodeVN(df_spatial)
df_spatial_decode
df_spatial_decode$name