read_delim_arrow {arrow} | R Documentation |
Read a CSV or other delimited file with Arrow
Description
These functions uses the Arrow C++ CSV reader to read into a tibble
.
Arrow C++ options have been mapped to argument names that follow those of
readr::read_delim()
, and col_select
was inspired by vroom::vroom()
.
Usage
read_delim_arrow(
file,
delim = ",",
quote = "\"",
escape_double = TRUE,
escape_backslash = FALSE,
schema = NULL,
col_names = TRUE,
col_types = NULL,
col_select = NULL,
na = c("", "NA"),
quoted_na = TRUE,
skip_empty_rows = TRUE,
skip = 0L,
parse_options = NULL,
convert_options = NULL,
read_options = NULL,
as_data_frame = TRUE,
timestamp_parsers = NULL,
decimal_point = "."
)
read_csv_arrow(
file,
quote = "\"",
escape_double = TRUE,
escape_backslash = FALSE,
schema = NULL,
col_names = TRUE,
col_types = NULL,
col_select = NULL,
na = c("", "NA"),
quoted_na = TRUE,
skip_empty_rows = TRUE,
skip = 0L,
parse_options = NULL,
convert_options = NULL,
read_options = NULL,
as_data_frame = TRUE,
timestamp_parsers = NULL
)
read_csv2_arrow(
file,
quote = "\"",
escape_double = TRUE,
escape_backslash = FALSE,
schema = NULL,
col_names = TRUE,
col_types = NULL,
col_select = NULL,
na = c("", "NA"),
quoted_na = TRUE,
skip_empty_rows = TRUE,
skip = 0L,
parse_options = NULL,
convert_options = NULL,
read_options = NULL,
as_data_frame = TRUE,
timestamp_parsers = NULL
)
read_tsv_arrow(
file,
quote = "\"",
escape_double = TRUE,
escape_backslash = FALSE,
schema = NULL,
col_names = TRUE,
col_types = NULL,
col_select = NULL,
na = c("", "NA"),
quoted_na = TRUE,
skip_empty_rows = TRUE,
skip = 0L,
parse_options = NULL,
convert_options = NULL,
read_options = NULL,
as_data_frame = TRUE,
timestamp_parsers = NULL
)
Arguments
file |
A character file name or URI, connection, literal data (either a
single string or a raw vector), an Arrow input stream, or a If a file name, a memory-mapped Arrow InputStream will be opened and closed when finished; compression will be detected from the file extension and handled automatically. If an input stream is provided, it will be left open. To be recognised as literal data, the input must be wrapped with |
delim |
Single character used to separate fields within a record. |
quote |
Single character used to quote strings. |
escape_double |
Does the file escape quotes by doubling them?
i.e. If this option is |
escape_backslash |
Does the file use backslashes to escape special
characters? This is more general than |
schema |
Schema that describes the table. If provided, it will be
used to satisfy both |
col_names |
If |
col_types |
A compact string representation of the column types,
an Arrow Schema, or |
col_select |
A character vector of column names to keep, as in the
"select" argument to |
na |
A character vector of strings to interpret as missing values. |
quoted_na |
Should missing values inside quotes be treated as missing
values (the default) or strings. (Note that this is different from the
the Arrow C++ default for the corresponding convert option,
|
skip_empty_rows |
Should blank rows be ignored altogether? If
|
skip |
Number of lines to skip before reading data. |
parse_options |
see CSV parsing options.
If given, this overrides any
parsing options provided in other arguments (e.g. |
convert_options |
|
read_options |
|
as_data_frame |
Should the function return a |
timestamp_parsers |
User-defined timestamp parsers. If more than one parser is specified, the CSV conversion logic will try parsing values starting from the beginning of this vector. Possible values are:
|
decimal_point |
Character to use for decimal point in floating point numbers. |
Details
read_csv_arrow()
and read_tsv_arrow()
are wrappers around
read_delim_arrow()
that specify a delimiter. read_csv2_arrow()
uses ;
for
the delimiter and ,
for the decimal point.
Note that not all readr
options are currently implemented here. Please file
an issue if you encounter one that arrow
should support.
If you need to control Arrow-specific reader parameters that don't have an
equivalent in readr::read_csv()
, you can either provide them in the
parse_options
, convert_options
, or read_options
arguments, or you can
use CsvTableReader directly for lower-level access.
Value
A tibble
, or a Table if as_data_frame = FALSE
.
Specifying column types and names
By default, the CSV reader will infer the column names and data types from the file, but there are a few ways you can specify them directly.
One way is to provide an Arrow Schema in the schema
argument,
which is an ordered map of column name to type.
When provided, it satisfies both the col_names
and col_types
arguments.
This is good if you know all of this information up front.
You can also pass a Schema
to the col_types
argument. If you do this,
column names will still be inferred from the file unless you also specify
col_names
. In either case, the column names in the Schema
must match the
data's column names, whether they are explicitly provided or inferred. That
said, this Schema
does not have to reference all columns: those omitted
will have their types inferred.
Alternatively, you can declare column types by providing the compact string representation
that readr
uses to the col_types
argument. This means you provide a
single string, one character per column, where the characters map to Arrow
types analogously to the readr
type mapping:
"c":
utf8()
"i":
int32()
"n":
float64()
"d":
float64()
"l":
bool()
"f":
dictionary()
"D":
date32()
"t":
time32()
(Theunit
arg is set to the default value"ms"
)"_":
null()
"-":
null()
"?": infer the type from the data
If you use the compact string representation for col_types
, you must also
specify col_names
.
Regardless of how types are specified, all columns with a null()
type will
be dropped.
Note that if you are specifying column names, whether by schema
or
col_names
, and the CSV file has a header row that would otherwise be used
to identify column names, you'll need to add skip = 1
to skip that row.
Examples
tf <- tempfile()
on.exit(unlink(tf))
write.csv(mtcars, file = tf)
df <- read_csv_arrow(tf)
dim(df)
# Can select columns
df <- read_csv_arrow(tf, col_select = starts_with("d"))
# Specifying column types and names
write.csv(data.frame(x = c(1, 3), y = c(2, 4)), file = tf, row.names = FALSE)
read_csv_arrow(tf, schema = schema(x = int32(), y = utf8()), skip = 1)
read_csv_arrow(tf, col_types = schema(y = utf8()))
read_csv_arrow(tf, col_types = "ic", col_names = c("x", "y"), skip = 1)
# Note that if a timestamp column contains time zones,
# the string "T" `col_types` specification won't work.
# To parse timestamps with time zones, provide a [Schema] to `col_types`
# and specify the time zone in the type object:
tf <- tempfile()
write.csv(data.frame(x = "1970-01-01T12:00:00+12:00"), file = tf, row.names = FALSE)
read_csv_arrow(
tf,
col_types = schema(x = timestamp(unit = "us", timezone = "UTC"))
)
# Read directly from strings with `I()`
read_csv_arrow(I("x,y\n1,2\n3,4"))
read_delim_arrow(I(c("x y", "1 2", "3 4")), delim = " ")