R: Read data from a Data Resource into a tibble data frame

read_resource {frictionless}

R Documentation

Read data from a Data Resource into a tibble data frame

Description

Reads data from a Data Resource (in a Data Package) into a tibble (a Tidyverse data frame). The resource must be a Tabular Data Resource. The function uses readr::read_delim() to read CSV files, passing the resource properties path, CSV dialect, column names, data types, etc. Column names are taken from the provided Table Schema (schema), not from the header in the CSV file(s).

Usage

read_resource(package, resource_name, col_select = NULL)

Arguments

`package`	Data Package object, created with `read_package()` or `create_package()`.
`resource_name`	Name of the Data Resource.
`col_select`	Character vector of the columns to include in the result, in the order provided. Selecting columns can improve read speed.

Value

tibble() data frame with the Data Resource's tabular data. If there are parsing problems, a warning will alert you. You can retrieve the full details by calling problems() on your data frame.

Resource properties

The Data Resource properties are handled as follows:

Path

path is required. It can be a local path or URL, which must resolve. Absolute path (/) and relative parent path (⁠../⁠) are forbidden to avoid security vulnerabilities.

When multiple paths are provided (⁠"path": [ "myfile1.csv", "myfile2.csv"]⁠) then data are merged into a single data frame, in the order in which the paths are listed.

Data

If path is not present, the function will attempt to read data from the data property. schema will be ignored.

Name

name is required. It is used to find the resource with name = resource_name.

Profile

profile is required to have the value tabular-data-resource.

File encoding

encoding (e.g. windows-1252) is required if the resource file(s) is not encoded as UTF-8. The returned data frame will always be UTF-8.

CSV Dialect

dialect properties are required if the resource file(s) deviate from the default CSV settings (see below). It can either be a JSON object or a path or URL referencing a JSON object. Only deviating properties need to be specified, e.g. a tab delimited file without a header row needs:

"dialect": {"delimiter": "\t", "header": "false"}

These are the CSV dialect properties. Some are ignored by the function:

delimiter: default ⁠,⁠.
lineTerminator: ignored, line terminator characters LF and CRLF are interpreted automatically by readr::read_delim(), while CR (used by Classic Mac OS, final release 2001) is not supported.
doubleQuote: default true.
quoteChar: default ⁠"⁠.
escapeChar: anything but ⁠\⁠ is ignored and it will set doubleQuote to false as these fields are mutually exclusive. You can thus not escape with ⁠\"⁠ and "" in the same file.
nullSequence: ignored, use missingValues.
skipInitialSpace: default false.
header: default true.
commentChar: not set by default.
caseSensitiveHeader: ignored, header is not used for column names, see Schema.
csvddfVersion: ignored.

File compression

Resource file(s) with path ending in .gz, .bz2, .xz, or .zip are automatically decompressed using default readr::read_delim() functionality. Only .gz files can be read directly from URL paths. Only the extension in path can be used to indicate compression type, the compression property is ignored.

Ignored resource properties

title
description
format
mediatype
bytes
hash
sources
licenses

Table schema properties

schema is required and must follow the Table Schema specification. It can either be a JSON object or a path or URL referencing a JSON object.

Field names are used as column headers.
Field types are use as column types (see further).
missingValues are used to interpret as NA, with "" as default.

Field types

Field type is used to set the column type, as follows:

string as character; or factor when enum is present. format is ignored.
number as double; or factor when enum is present. Use bareNumber: false to ignore whitespace and non-numeric characters. decimalChar (. by default) and groupChar (undefined by default) can be defined, but the most occurring value will be used as a global value for all number fields of that resource.
integer as double (not integer, to avoid issues with big numbers); or factor when enum is present. Use bareNumber: false to ignore whitespace and non-numeric characters.
boolean as logical. Non-default trueValues/falseValues are not supported.
object as character.
array as character.
date as date. Supports format, with values default (ISO date), any (guess ymd) and Python/C strptime patterns, such as ⁠%a, %d %B %Y⁠ for ⁠Sat, 23 November 2013⁠. ⁠%x⁠ is ⁠%m/%d/%y⁠. ⁠%j⁠, ⁠%U⁠, ⁠%w⁠ and ⁠%W⁠ are not supported.
time as hms::hms(). Supports format, with values default (ISO time), any (guess hms) and Python/C strptime patterns, such as ⁠%I%p%M:%S.%f%z⁠ for ⁠8AM30:00.300+0200⁠.
datetime as POSIXct. Supports format, with values default (ISO datetime), any (ISO datetime) and the same patterns as for date and time. ⁠%c⁠ is not supported.
year as date, with 01 for month and day.
yearmonth as date, with 01 for day.
duration as character. Can be parsed afterwards with lubridate::duration().
geopoint as character.
geojson as character.
any as character.
Any other value is not allowed.
Type is guessed if not provided.

Examples

# Read a datapackage.json file
package <- read_package(
  system.file("extdata", "datapackage.json", package = "frictionless")
)

package

# Read data from the resource "observations"
read_resource(package, "observations")

# The above tibble is merged from 2 files listed in the resource path
package$resources[[2]]$path

# The column names and types are derived from the resource schema
purrr::map_chr(package$resources[[2]]$schema$fields, "name")
purrr::map_chr(package$resources[[2]]$schema$fields, "type")

# Read data from the resource "deployments" with column selection
read_resource(package, "deployments", col_select = c("latitude", "longitude"))

[Package frictionless version 1.1.0 Index]