R: Read a fst file by chunks

import_fst_chunked {tidyfst}

R Documentation

Read a fst file by chunks

Description

For 'import_fst_chunked', if a large fst file which could not be imported into the memory all at once, this function could read the fst file by chunks and preprocessed the chunk to ensure the results yielded by the chunks are small enough to be summarised in the end. For 'get_fst_chunk_size', this function can measure the memory used by a specified row number.

Usage

import_fst_chunked(
  path,
  chunk_size = 10000L,
  chunk_f = identity,
  combine_f = rbindlist
)

get_fst_chunk_size(path, nrows)

Arguments

`path`	Path to fst file
`chunk_size`	Integer. The number of rows to include in each chunk
`chunk_f`	A function implemented on every chunk.
`combine_f`	A function to aggregate all the elements from the list of results from chunks.
`nrows`	Number of rows to test.

Value

For 'import_fst_chunked', default to the whole data.frame in data.table. Could be adjusted to any type. For 'get_fst_chunk_size', return the file size.

Examples


## Not run: 
  # Generate some random data frame with 10 million rows and various column types
  nr_of_rows <- 1e7
  df <- data.frame(
    Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
    Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
    Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
    Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
  )

  # Write the file to disk
  fst_file <- tempfile(fileext = ".fst")
  write_fst(df, fst_file)

  # Get the size of 10000 rows
  get_fst_chunk_size(fst_file,1e4)

  # File all rows that Integer == 7 by chunks
  import_fst_chunked(fst_file,chunk_f = \(x) x[Integer==7])


## End(Not run)