UTF8filepaths {base} | R Documentation |
File Paths not in the Native Encoding
Description
Most modern file systems store file-path components (names of directories and files) in a character encoding of wide scope: usually UTF-8 on a Unix-alike and UCS-2/UTF-16 on Windows. However, this was not true when R was first developed and there are still exceptions amongst file systems, e.g. FAT32.
This was not something anticipated by the C and POSIX standards which only provide means to access files via file paths encoded in the current locale, for example those specified in Latin-1 in a Latin-1 locale.
Everything here apart from the specific section on Windows is about Unix-alikes.
Details
It is possible to mark character strings (elements of character
vectors) as being in UTF-8 or Latin-1 (see Encoding
).
This allows file paths not in the native encoding to be
expressed in R character vectors but there is almost no way to use
them unless they can be translated to the native encoding. That is of
course not a problem if that is UTF-8, so these details are really only
relevant to the use of a non-UTF-8 locale (including a C locale) on a
Unix-alike.
Functions to open a file such as file
,
fifo
, pipe
, gzfile
,
bzfile
, xzfile
and unz
give
an error for non-native filepaths. Where functions look at existence
such as file.exists
, dir.exists
,
unlink
, file.info
and
list.files
, non-native filepaths are treated as
non-existent.
Many other functions use file
or gzfile
to open their
files.
file.path
allows non-native file paths to be combined,
marking them as UTF-8 if needed.
path.expand
only handles paths in the native encoding.
Windows
Windows provides proprietary entry points to access its file systems, and these gained ‘wide’ versions in Windows NT that allowed file paths in UCS-2/UTF-16 to be accessed from any locale.
Some R functions use these entry points when file paths are marked
as Latin-1 or UTF-8 to allow access to paths not in the current
encoding. These include
file
, file.access
,
file.append
, file.copy
,
file.create
, file.exists
,
file.info
, file.link
,
file.remove
, file.rename
,
file.symlink
and
dir.create
, dir.exists
,
normalizePath
, path.expand
,
pipe
, Sys.glob
,
Sys.junction
,
unlink
but not gzfile
bzfile
,
xzfile
nor unz
.
For functions using gzfile
(including
load
, readRDS
, read.dcf
and
tar
), it is often possible to use a gzcon
connection wrapping a file
connection.
Other notable exceptions are list.files
,
list.dirs
, system
and file-path inputs for
graphics devices.
Historical comment
Before R 4.0.0, file paths marked as being in Latin-1 or UTF-8 were silently translated to the native encoding using escapes such as ‘<e7>’ or ‘<U+00e7>’. This created valid file names but maybe not those intended.
Note
This document is still a work-in-progress.