open_dataset {duckdbfs} | R Documentation |
Open a dataset from a variety of sources
Description
This function opens a dataset from a variety of sources, including Parquet, CSV, etc, using either local file system paths, URLs, or S3 bucket URI notation.
Usage
open_dataset(
sources,
schema = NULL,
hive_style = TRUE,
unify_schemas = FALSE,
format = c("parquet", "csv", "tsv", "sf"),
conn = cached_connection(),
tblname = tmp_tbl_name(),
mode = "VIEW",
filename = FALSE,
recursive = TRUE,
...
)
Arguments
sources |
A character vector of paths to the dataset files. |
schema |
The schema for the dataset. If NULL, the schema will be inferred from the dataset files. |
hive_style |
A logical value indicating whether to the dataset uses Hive-style partitioning. |
unify_schemas |
A logical value indicating whether to unify the schemas of the dataset files (union_by_name). If TRUE, will execute a UNION by column name across all files (NOTE: this can add considerably to the initial execution time) |
format |
The format of the dataset files. One of |
conn |
A connection to a database. |
tblname |
The name of the table to create in the database. |
mode |
The mode to create the table in. One of |
filename |
A logical value indicating whether to include the filename in the table name. |
recursive |
should we assume recursive path? default TRUE. Set to FALSE if trying to open a single, un-partitioned file. |
... |
optional additional arguments passed to |
Value
A lazy dplyr::tbl
object representing the opened dataset backed
by a duckdb SQL connection. Most dplyr
(and some tidyr
) verbs can be
used directly on this object, as they can be translated into SQL commands
automatically via dbplyr
. Generic R commands require using
dplyr::collect()
on the table, which forces evaluation and reading the
resulting data into memory.
Examples
# A remote, hive-partitioned Parquet dataset
base <- paste0("https://github.com/duckdb/duckdb/raw/main/",
"data/parquet-testing/hive-partitioning/union_by_name/")
f1 <- paste0(base, "x=1/f1.parquet")
f2 <- paste0(base, "x=1/f2.parquet")
f3 <- paste0(base, "x=2/f2.parquet")
open_dataset(c(f1,f2,f3), unify_schemas = TRUE)
# Access an S3 database specifying an independently-hosted (MINIO) endpoint
efi <- open_dataset("s3://neon4cast-scores/parquet/aquatics",
s3_access_key_id="",
s3_endpoint="data.ecoforecast.org")