dataset_factory {arrow}R Documentation

Create a DatasetFactory


A Dataset can constructed using one or more DatasetFactorys. This function helps you construct a DatasetFactory that you can pass to open_dataset().


  filesystem = NULL,
  format = c("parquet", "arrow", "ipc", "feather", "csv", "tsv", "text"),
  partitioning = NULL,



A string path to a directory containing data files, a vector of one one or more string paths to data files, or a list of DatasetFactory objects whose datasets should be combined. If this argument is specified it will be used to construct a UnionDatasetFactory and other arguments will be ignored.


A FileSystem object; if omitted, the FileSystem will be detected from x


A FileFormat object, or a string identifier of the format of the files in x. Currently supported values:

  • "parquet"

  • "ipc"/"arrow"/"feather", all aliases for each other; for Feather, note that only version 2 files are supported

  • "csv"/"text", aliases for the same thing (because comma is the default delimiter for text files

  • "tsv", equivalent to passing format = "text", delimiter = "\t"

Default is "parquet", unless a delimiter is also specified, in which case it is assumed to be "text".


One of

  • A Schema, in which case the file paths relative to sources will be parsed, and path segments will be matched with the schema fields. For example, schema(year = int16(), month = int8()) would create partitions for file paths like "2019/01/file.parquet", "2019/02/file.parquet", etc.

  • A character vector that defines the field names corresponding to those path segments (that is, you're providing the names that would correspond to a Schema but the types will be autodetected)

  • A HivePartitioning or HivePartitioningFactory, as returned by hive_partition() which parses explicit or autodetected fields from Hive-style path segments

  • NULL for no partitioning


Additional format-specific options, passed to FileFormat$create(). For CSV options, note that you can specify them either with the Arrow C++ library naming ("delimiter", "quoting", etc.) or the readr-style naming used in read_csv_arrow() ("delim", "quote", etc.). Not all readr options are currently supported; please file an issue if you encounter one that arrow should support.


If you would only have a single DatasetFactory (for example, you have a single directory containing Parquet files), you can call open_dataset() directly. Use dataset_factory() when you want to combine different directories, file systems, or file formats.


A DatasetFactory object. Pass this to open_dataset(), in a list potentially with other DatasetFactory objects, to create a Dataset.

[Package arrow version 4.0.1 Index]