csv_to_disk.frame {disk.frame} | R Documentation |
Convert CSV file(s) to disk.frame format
Description
Convert CSV file(s) to disk.frame format
Usage
csv_to_disk.frame(
infile,
outdir = tempfile(fileext = ".df"),
inmapfn = base::I,
nchunks = recommend_nchunks(sum(file.size(infile))),
in_chunk_size = NULL,
shardby = NULL,
compress = 50,
overwrite = TRUE,
header = TRUE,
.progress = TRUE,
backend = c("data.table", "readr", "LaF"),
chunk_reader = c("bigreadr", "data.table", "readr", "readLines"),
...
)
Arguments
infile |
The input CSV file or files |
outdir |
The directory to output the disk.frame to |
inmapfn |
A function to be applied to the chunk read in from CSV before the chunk is being written out. Commonly used to perform simple transformations. Defaults to the identity function (ie. no transformation) |
nchunks |
Number of chunks to output |
in_chunk_size |
When reading in the file, how many lines to read in at once. This is different to nchunks which controls how many chunks are output |
shardby |
The column(s) to shard the data by. For example suppose 'shardby = c("col1","col2")' then every row where the values 'col1' and 'col2' are the same will end up in the same chunk; this will allow merging by 'col1' and 'col2' to be more efficient |
compress |
For fst backends it's a number between 0 and 100 where 100 is the highest compression ratio. |
overwrite |
Whether to overwrite the existing directory |
header |
Whether the files have header. Defaults to TRUE |
.progress |
A logical, for whether or not to show progress |
backend |
The CSV reader backend to choose: "data.table" or "readr". disk.frame does not have its own CSV reader. It uses either data.table::fread or readr::read_delimited. It is worth noting that data.table::fread does not detect dates and all dates are imported as strings, and you are encouraged to use fasttime to convert the strings to date. You can use the 'inmapfn' to do that. However, if you want automatic date detection, then backend="readr" may suit your needs. However, readr is often slower than data.table, hence data.table is chosen as the default. |
chunk_reader |
Even if you choose a backend there can still be multiple strategies on how to approach the CSV reads. For example, data.table::fread tries to mmap the whole file which can cause the whole read process to fail. In that case we can change the chunk_reader to "readLines" which uses the readLines function to read chunk by chunk and still use data.table::fread to process the chunks. There are currently no strategies for readr backend, except the default one. |
... |
passed to data.table::fread, disk.frame::as.disk.frame, disk.frame::shard |
See Also
Other ingesting data:
zip_to_disk.frame()
Examples
tmpfile = tempfile()
write.csv(cars, tmpfile)
tmpdf = tempfile(fileext = ".df")
df = csv_to_disk.frame(tmpfile, outdir = tmpdf, overwrite = TRUE)
# clean up
fs::file_delete(tmpfile)
delete(df)