hdd_slice {hdd} | R Documentation |
Applies a function to slices of data to create a HDD data set
Description
This function is useful to apply complex R functions to large data sets (out of memory). It slices the input data, applies the function, then saves each chunk into a hard drive folder. This can then be a HDD data set.
Usage
hdd_slice(
x,
fun,
dir,
chunkMB = 500,
rowsPerChunk,
replace = FALSE,
verbose = 1,
...
)
Arguments
x |
A data set (data.frame, HDD). |
fun |
A function to be applied to slices of the data set. The function must return a data frame like object. |
dir |
The destination directory where the data is saved. |
chunkMB |
The size of the slices, default is 500MB. That is: the function |
rowsPerChunk |
Integer, default is missing. Alternative to the argument |
replace |
Whether all information on the destination directory should be erased beforehand. Default is |
verbose |
Integer, defaults to 1. If greater than 0 then the progress is displayed. |
... |
Other parameters to be passed to |
Details
This function splits the original data into several slices and then apply a function to each of them, saving the results into a HDD data set.
You can perform merging operations with hdd_slice
, but for regular merges not that you have the function hdd_merge
that may prove more convenient (not need to write a ad hoc function).
Value
It doesn't return anything, the output is a "hard drive data" saved in the hard drive.
Author(s)
Laurent Berge
See Also
See hdd
, sub-.hdd
and cash-.hdd
for the extraction and manipulation of out of memory data. For importation of
HDD data sets from text files: see txt2hdd
.
See hdd_slice
to apply functions to chunks of data (and create
HDD objects) and hdd_merge
to merge large files.
To create/reshape HDD objects from memory or from other HDD objects, see
write_hdd
.
To display general information from HDD objects: origin
,
summary.hdd
, print.hdd
,
dim.hdd
and names.hdd
.
Examples
# Toy example with iris data.
# Say you want to perform a cartesian merge
# If the results of the function is out of memory
# you can use hdd_slice (not the case for this example)
# preparing the cartesian merge
iris_bis = iris
names(iris_bis) = c(paste0("x_", 1:4), "species_bis")
fun_cartesian = function(x){
# Note that x is treated as a data.table
# => we need the argument allow.cartesian
merge(x, iris_bis, allow.cartesian = TRUE)
}
hdd_result = tempfile() # => folder where results are saved
hdd_slice(iris, fun_cartesian, dir = hdd_result, rowsPerChunk = 30)
# Let's look at the result
base_hdd = hdd(hdd_result)
summary(base_hdd)
head(base_hdd)