write_parquet {arrow} | R Documentation |
Write Parquet file to disk
Description
Parquet is a columnar storage file format. This function enables you to write Parquet files from R.
Usage
write_parquet(
x,
sink,
chunk_size = NULL,
version = "2.4",
compression = default_parquet_compression(),
compression_level = NULL,
use_dictionary = NULL,
write_statistics = NULL,
data_page_size = NULL,
use_deprecated_int96_timestamps = FALSE,
coerce_timestamps = NULL,
allow_truncated_timestamps = FALSE
)
Arguments
x |
|
sink |
A string file path, connection, URI, or OutputStream, or path in a file
system ( |
chunk_size |
how many rows of data to write to disk at once. This
directly corresponds to how many rows will be in each row group in
parquet. If |
version |
parquet version: "1.0", "2.0" (deprecated), "2.4" (default), "2.6", or "latest" (currently equivalent to 2.6). Numeric values are coerced to character. |
compression |
compression algorithm. Default "snappy". See details. |
compression_level |
compression level. Meaning depends on compression algorithm |
use_dictionary |
logical: use dictionary encoding? Default |
write_statistics |
logical: include statistics? Default |
data_page_size |
Set a target threshold for the approximate encoded size of data pages within a column chunk (in bytes). Default 1 MiB. |
use_deprecated_int96_timestamps |
logical: write timestamps to INT96
Parquet format, which has been deprecated? Default |
coerce_timestamps |
Cast timestamps a particular resolution. Can be
|
allow_truncated_timestamps |
logical: Allow loss of data when coercing
timestamps to a particular resolution. E.g. if microsecond or nanosecond
data is lost when coercing to "ms", do not raise an exception. Default
|
Details
Due to features of the format, Parquet files cannot be appended to. If you want to use the Parquet format but also want the ability to extend your dataset, you can write to additional Parquet files and then treat the whole directory of files as a Dataset you can query. See the dataset article for examples of this.
The parameters compression
, compression_level
, use_dictionary
and
write_statistics
support various patterns:
The default
NULL
leaves the parameter unspecified, and the C++ library uses an appropriate default for each column (defaults listed above)A single, unnamed, value (e.g. a single string for
compression
) applies to all columnsAn unnamed vector, of the same size as the number of columns, to specify a value for each column, in positional order
A named vector, to specify the value for the named columns, the default value for the setting is used when not supplied
The compression
argument can be any of the following (case-insensitive):
"uncompressed", "snappy", "gzip", "brotli", "zstd", "lz4", "lzo" or "bz2".
Only "uncompressed" is guaranteed to be available, but "snappy" and "gzip"
are almost always included. See codec_is_available()
.
The default "snappy" is used if available, otherwise "uncompressed". To
disable compression, set compression = "uncompressed"
.
Note that "uncompressed" columns may still have dictionary encoding.
Value
the input x
invisibly.
See Also
ParquetFileWriter for a lower-level interface to Parquet writing.
Examples
tf1 <- tempfile(fileext = ".parquet")
write_parquet(data.frame(x = 1:5), tf1)
# using compression
if (codec_is_available("gzip")) {
tf2 <- tempfile(fileext = ".gz.parquet")
write_parquet(data.frame(x = 1:5), tf2, compression = "gzip", compression_level = 5)
}