decodeVN {vietnameseConverter}R Documentation

Convert characters from legacy Vietnamese encodings to UTF-8 encoding

Description

Convert characters from legacy Vietnamese encodings to UTF-8 encoding

Usage

decodeVN(
  x,
  from = c("TCVN3", "VISCII", "VPS", "Unicode"),
  to = c("Unicode", "TCVN3", "VISCII", "VPS"),
  diacritics = TRUE
)

Arguments

x

data.frame, sf object, or character vector

from

Text encoding of input x

to

Text encoding of output

diacritics

logical. Preserve diacritics (TRUE) or not (FALSE)?

Details

Many characters in legacy Vietnamese encodings (e.g. TCVN3, VPS, VISCII) are not read correctly in R, particularly those with diacritics (accents). The particular encodings don't seem to be supported by R, at least on many locales. When R reads them as if they have UTF-8 encoding, it will result in wrong characters being printed and garbled text (Mojibake - see vignette and examples below).

This functions converts character vectors to from various Vietnamese legacy encodings to readable Unicode characters in UTF-8 encoding. By default the function attempts the conversion from TCVN3 to Unicode while preserving the diacritics, but also supports other Vietnamese encodings (TCVN3, VPS, VISCII - via argument from). Currently VNI and VNU are not supported.

It works on data frames, spatial objects (from the sf package), and character vectors.

diacritics = TRUE will return characters with their diacritics. With diacritics = FALSE, the output will be ASCII letters without diacritics. Upper/lower case will be preserved regardless.

The internal search and replace is performed by the gsubfn function from the gsubfn package. It performs simple character replacements to fix the text.

Currently the function converts from the Vietnamese encodings to Unicode, not vice versa. Please contact the maintainer if the conversion from Unicode to Vietnamese encodings would be relevant for you.

The character conversion table was adapted from http://vietunicode.sourceforge.net/charset/.

Value

character string or data frame (depending on x)

Warning

When printing a data frame with Unicode characters using the standard print method, the R console will show the Unicode escape characters (e.g. "<U+1EA3>") instead of the actual Unicode characters. This is a limitation of the R console. The data are correct and will show correctly when using e.g. View() or when printing columns as vectors.

Examples

   # First we produce the wrongly formatted character string
   # using Unicode symbols is only necessary to create a portable example in the R package
   # you don't need to use Unicode characters like this in your data

   string <- c("Qu\u00B6ng Tr\u00DE", "An \u00A7\u00ABn", "Th\u00F5a Thi\u00AAn Hu\u00D5")

   # Below we have a look at the wrongly formatted character string.
   # This is what it would look like when you load TCVN3 encoded data as UTF8
   string

   # convert character vector from TCVN3 > UTF-8
   decodeVN(string)
   decodeVN(string, diacritics = FALSE)

   # # convert data frame columns from TCVN3 > UTF-8
   df <- data.frame(id = c(1,2,3),
                   name  = string)

   df_decode <- decodeVN(df)
   df_decode
   # NOTE: some characters may be displayed as unicode in the R console
   # check the individual column to see if they are correct:
   df_decode[,2]

   decodeVN(df, diacritics = FALSE)

   # using the built-in sample data
   data(vn_samples)
   decodeVN(vn_samples$TCVN3)   # TCVN -> Unicode   # TCVN3 -> Unicode
   decodeVN(vn_samples$TCVN3, diacritics = FALSE)   # TCVN3 -> Unicode (ASCII characters only)
   decodeVN(vn_samples$VISCII, from = "VISCII")     # VISCII -> Unicode


   # Demonstration for sf object

   # create sf object (just for demonstration)
   require(sf)
   df_geom <- st_sfc(st_point(c(3,4)), st_point(c(10,11)), st_point(c(15,13)))
   df_spatial <- st_set_geometry(df, df_geom)

   # convert Vietnamese characters
   df_spatial_decode <- decodeVN(df_spatial)

   df_spatial_decode
   df_spatial_decode$name



[Package vietnameseConverter version 0.4.0 Index]