FileFormat {arrow} | R Documentation |
Dataset file formats
Description
A FileFormat
holds information about how to read and parse the files
included in a Dataset
. There are subclasses corresponding to the supported
file formats (ParquetFileFormat
and IpcFileFormat
).
Factory
FileFormat$create()
takes the following arguments:
-
format
: A string identifier of the file format. Currently supported values:"parquet"
"ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only version 2 files are supported
"csv"/"text", aliases for the same thing (because comma is the default delimiter for text files
"tsv", equivalent to passing
format = "text", delimiter = "\t"
-
...
: Additional format-specific optionsformat = "parquet"
:-
dict_columns
: Names of columns which should be read as dictionaries. Any Parquet options from FragmentScanOptions.
format = "text"
: see CsvParseOptions. Note that you can specify them either with the Arrow C++ library naming ("delimiter", "quoting", etc.) or thereadr
-style naming used inread_csv_arrow()
("delim", "quote", etc.). Not allreadr
options are currently supported; please file an issue if you encounter one thatarrow
should support. Also, the following options are supported. From CsvReadOptions:-
skip_rows
-
column_names
. Note that if a Schema is specified,column_names
must match those specified in the schema. -
autogenerate_column_names
From CsvFragmentScanOptions (these values can be overridden at scan time): -
convert_options
: a CsvConvertOptions -
block_size
-
It returns the appropriate subclass of FileFormat
(e.g. ParquetFileFormat
)
Examples
## Semi-colon delimited files
# Set up directory for examples
tf <- tempfile()
dir.create(tf)
on.exit(unlink(tf))
write.table(mtcars, file.path(tf, "file1.txt"), sep = ";", row.names = FALSE)
# Create FileFormat object
format <- FileFormat$create(format = "text", delimiter = ";")
open_dataset(tf, format = format)