bread {bread}R Documentation

Reads a file in table format, selecting columns, subsetting rows by number and filtering them by column values

Description

Wrapper for data.table::fread() simplifying the use of Unix commands like 'grep', 'cut', 'awk' and 'sed' on a data file before loading it in memory. The Unix commands are automatically generated from the arguments. This is useful if you want to load a big file too large for your available memory (and encounter the 'cannot allocate vector of size' error) and know you can work on a subsample. 'b' stands for 'big file'. This function allows to subset rows by their index number, select columns and filter with a pattern.

Usage

bread(
  file = NULL,
  first_row = NULL,
  last_row = NULL,
  head = NULL,
  tail = NULL,
  colnames = NULL,
  colnums = NULL,
  patterns = NULL,
  filtered_columns = NULL,
  fixed = FALSE,
  range_min = NULL,
  range_max = NULL,
  numrange_columns = NULL,
  ...
)

Arguments

file

String. Name or full path to a file compatible with data.table::fread()

first_row

Numeric. First row of the portion of the file to subset.

last_row

Numeric. Last row of the portion of the file to subset.

head

Numeric. How many rows starting from the first in the file.

tail

Numeric. How many rows starting from the last in the file.

colnames

Vector of strings. Exact names of columns to select. If both colnames and colnums are provided, colnums will be prefered.

colnums

Vector of numeric. Columns index numbers.

patterns

Vector of strings. One or several patterns used to filter the data from the input file. Each element of the vector should correspond to the column to be filtered. Can use regular expressions.

filtered_columns

Vector of strings or numeric. Optional. The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the pattern with which it will be filtered.

fixed

Logical. If TRUE, pattern is a string to be matched as is. Overrides all conflicting arguments.

range_min

Vector of numeric. One or several minimal values used to filter (inclusively, as in superior OR EQUAL to that value) the data from the input file. Each element of the vector should correspond to the numrange_column to be filtered.

range_max

Vector of numeric. One or several maximal values used to filter (inclusively, as in inferior OR EQUAL to that value) the data from the input file. Each element of the vector should correspond to the numrange_column to be filtered.

numrange_columns

Vector of strings or numeric. The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the range_min and range_man values with which it will be filtered.

...

Arguments that must be passed to data.table::fread() like 'sep' or 'dec'.

Details

You can mix and match the row subsetting, the filtering by value and the selecting of columns. In order, the function:

  1. subsets the rows by their numbers (with 'sed' & 'awk'). You need to input the index number of the first and last rows you want to load in memory with fread(), or alternatively use either the head or tail arguments to subset the first or last rows of the file.

  2. selects columns by index number or name (with 'cut'). If both colnames and colnums are provided, colnums will be prefered.

  3. filters the data selected so far with a pattern by column (with 'grep'). The columns to be filtered should be indicated through their names or their index number. Each element of the vector should correspond to the pattern with which it will be filtered.

  4. filters (inclusively, as in inferior/superior OR EQUAL) the data selected so far by numerical value on a different set of provided columns with the 'sed' command.

Value

A data frame with the selected columns and the subsetted and filtered data

Warning

Best practice would probably be to load the big file in a SQL database or something. Or not working on huge CSV files in the first place. But if you have to, you hopefully won"t have to delve into the fascinating grammar of Unix commands.

Examples

file <- system.file('extdata', 'test.csv', package = 'bread')
## Select the columns numbered 1 and 3
bread(file = file, colnums = c(1,3))
## Select the columns named 'YEAR' and 'PRICE', then filter to keep only the
## value '2022' in column 'YEAR'
bread(file = file, colnames = c('YEAR', 'PRICE'),
      patterns = 2002, filtered_columns = 'YEAR')
## Select the columns names 'YEAR' and 'PRICE', then filter to keep only values
## superior or equal to 2004 in YEAR and to 2000 in PRICE
bread(file = file, colnames = c("YEAR", "PRICE"),
      range_min = c(2004,2000), numrange_columns = c(1,3))
## Subset to keep only the rows 10 to 18, select the columns named 'YEAR'
## and 'COLOR' then filter to keep only the value 'red' in column 'COLOR'
bread(file = file, colnames = c('YEAR', 'COLOR'),
      patterns = 'red', filtered_columns = 'COLOR',
      first_row = 10, last_row = 18)

[Package bread version 0.4.1 Index]