hdd_slice {hdd}R Documentation

Applies a function to slices of data to create a HDD data set

Description

This function is useful to apply complex R functions to large data sets (out of memory). It slices the input data, applies the function, then saves each chunk into a hard drive folder. This can then be a HDD data set.

Usage

hdd_slice(
  x,
  fun,
  dir,
  chunkMB = 500,
  rowsPerChunk,
  replace = FALSE,
  verbose = 1,
  ...
)

Arguments

x

A data set (data.frame, HDD).

fun

A function to be applied to slices of the data set. The function must return a data frame like object.

dir

The destination directory where the data is saved.

chunkMB

The size of the slices, default is 500MB. That is: the function fun is applied to each 500Mb of data x. If the function creates a lot of additional information, you may want this number to go down. On the other hand, if the function reduces the information you may want this number to go up. In the end it will depend on the amount of memory available.

rowsPerChunk

Integer, default is missing. Alternative to the argument chunkMB. If provided, the functions will be applied to chunks of rowsPerChunk of x.

replace

Whether all information on the destination directory should be erased beforehand. Default is FALSE.

verbose

Integer, defaults to 1. If greater than 0 then the progress is displayed.

...

Other parameters to be passed to fun.

Details

This function splits the original data into several slices and then apply a function to each of them, saving the results into a HDD data set.

You can perform merging operations with hdd_slice, but for regular merges not that you have the function hdd_merge that may prove more convenient (not need to write a ad hoc function).

Value

It doesn't return anything, the output is a "hard drive data" saved in the hard drive.

Author(s)

Laurent Berge

See Also

See hdd, sub-.hdd and cash-.hdd for the extraction and manipulation of out of memory data. For importation of HDD data sets from text files: see txt2hdd.

See hdd_slice to apply functions to chunks of data (and create HDD objects) and hdd_merge to merge large files.

To create/reshape HDD objects from memory or from other HDD objects, see write_hdd.

To display general information from HDD objects: origin, summary.hdd, print.hdd, dim.hdd and names.hdd.

Examples


# Toy example with iris data.
# Say you want to perform a cartesian merge
# If the results of the function is out of memory
# you can use hdd_slice (not the case for this example)

# preparing the cartesian merge
iris_bis = iris
names(iris_bis) = c(paste0("x_", 1:4), "species_bis")


fun_cartesian = function(x){
	# Note that x is treated as a data.table
	# => we need the argument allow.cartesian
	merge(x, iris_bis, allow.cartesian = TRUE)
}

hdd_result = tempfile() # => folder where results are saved
hdd_slice(iris, fun_cartesian, dir = hdd_result, rowsPerChunk = 30)

# Let's look at the result
base_hdd = hdd(hdd_result)
summary(base_hdd)
head(base_hdd)




[Package hdd version 0.1.1 Index]