R: Read, process each block and write the result

ffply {fplyr}

R Documentation

Read, process each block and write the result

Description

Suppose you want to process each block of a file and the result is again a data.table that you want to print to some output file. One possible approach is to use l <- flply(...) followed by do.call(rbind, l) and fwrite, but this would be slow. ffply offers a faster solution to this problem.

Usage

ffply(
  input,
  output = "",
  FUN,
  ...,
  key.sep = "\t",
  sep = "\t",
  skip = 0,
  header = TRUE,
  nblocks = Inf,
  stringsAsFactors = FALSE,
  colClasses = NULL,
  select = NULL,
  drop = NULL,
  col.names = NULL,
  parallel = 1
)

Arguments

`input`	Path of the input file.
`output`	String containing the path to the output file.
`FUN`	Function to be applied to each block. It must take at least two arguments, the first of which is a `data.table` containing the current block, without the first field; the second argument is a character vector containing the value of the first field for the current block.
`...`	Additional arguments to be passed to FUN.
`key.sep`	The character that delimits the first field from the rest.
`sep`	The field delimiter (often equal to `key.sep`).
`skip`	Number of lines to skip at the beginning of the file
`header`	Whether the file has a header.
`nblocks`	The number of blocks to read.
`stringsAsFactors`	Whether to convert strings into factors.
`colClasses`	Vector or list specifying the class of each field.
`select`	The columns (names or numbers) to be read.
`drop`	The columns (names or numbers) not to be read.
`col.names`	Names of the columns.
`parallel`	Number of cores to use.

Value

Returns NULL invisibly. As a side effect, writes the processed data.table to the output file.

Slogan

ffply: from file to file

Examples

f1 <- system.file("extdata", "dt_iris.csv", package = "fplyr")
f2 <- tempfile()

# Copy the first two blocks from f1 into f2 to obtain a shorter but
# consistent version of the original input file.
ffply(f1, f2, function(d, by) {return(d)}, nblocks = 2)

# Compute the mean of the columns for each species
ffply(f1, f2, function(d, by) d[, lapply(.SD, mean)])

# Reshape the file, block by block
ffply(f1, f2, function(d, by) {
    val <- do.call(c, d)
    var <- rep(names(d), each = nrow(d))
    data.table(Var = var, Val = val)
})

[Package fplyr version 1.3.0 Index]