txt2hdd {hdd} | R Documentation |
Transforms text data into a HDD file
Description
Imports text data and saves it into a HDD file. It uses read_delim_chunked
to extract the data. It also allows to preprocess the data.
Usage
txt2hdd(
path,
dirDest,
chunkMB = 500,
rowsPerChunk,
col_names,
col_types,
nb_skip,
delim,
preprocessfun,
replace = FALSE,
encoding = "UTF-8",
verbose = 0,
locale = NULL,
...
)
Arguments
path |
Character vector that represents the path to the data. Note that it can be equal to patterns if multiple files with the same name are to be imported (if so it must be a fixed pattern, NOT a regular expression). |
dirDest |
The destination directory, where the new HDD data should be saved. |
chunkMB |
The chunk sizes in MB, defaults to 500MB. Instead of using this
argument, you can alternatively use the argument |
rowsPerChunk |
Number of rows per chunk. By default it is missing: its value
is deduced from argument |
col_names |
The column names, by default is uses the ones of the data set. If the data set lacks column names, you must provide them. |
col_types |
The column types, in the |
nb_skip |
Number of lines to skip. |
delim |
The delimiter. By default the function tries to find the delimiter, but sometimes it fails. |
preprocessfun |
A function that is applied to the data before saving. Default is missing. Note that if a function is provided, it MUST return a data.frame, anything other than data.frame is ignored. |
replace |
If the destination directory already exists, you need to set the
argument |
encoding |
Character scalar containing the encoding of the file to be read.
By default it is "UTF-8" and is passed to the Note that this argument is ignored if the argument |
verbose |
Logical scalar or |
locale |
Either |
... |
Other arguments to be passed to |
Details
This function uses read_delim_chunked
from readr
to read a large text file per chunk, and generate a HDD data set.
Since the main function for importation uses readr
, the column specification
must also be in readr's style (namely cols
or cols_only
).
By default a guess of the column types is made on the first 10,000 rows. The
guess is the application of guess_col_types
on these rows.
Note that by default, columns that are found to be integers are imported as double (in want of integer64 type in readr). Note that for large data sets, sometimes integer-like identifiers can be larger than 16 digits: in these case you must import them as character not to lose information.
The delimiter is found with the function guess_delim
, which
uses the guessing from fread
. Note that fixed width
delimited files are not supported.
Value
This function does not return anything in R. Instead it creates a folder
on disk containing .fst
files. These files represent the data that has been
imported and converted to the hdd
format.
You can then read the created data with the function hdd()
.
Author(s)
Laurent Berge
See Also
See hdd
, sub-.hdd
and cash-.hdd
for the extraction and manipulation of out of memory data. For importation of
HDD data sets from text files: see txt2hdd
.
See hdd_slice
to apply functions to chunks of data (and create
HDD objects) and hdd_merge
to merge large files.
To create/reshape HDD objects from memory or from other HDD objects, see
write_hdd
.
To display general information from HDD objects: origin
,
summary.hdd
, print.hdd
,
dim.hdd
and names.hdd
.
Examples
# Toy example with iris data
# we create a text file on disk
iris_path = tempfile()
fwrite(iris, iris_path)
# destination path
hdd_path = tempfile()
# reading the text file with HDD, with approx. 50 rows per chunk:
txt2hdd(iris_path, dirDest = hdd_path, rowsPerChunk = 50)
base_hdd = hdd(hdd_path)
summary(base_hdd)
# Same example with preprocessing
sl_keep = sort(unique(sample(iris$Sepal.Length, 40)))
fun = function(x){
# we keep only some observations & vars + renaming
res = x[Sepal.Length %in% sl_keep, .(sl = Sepal.Length, Species)]
# we create some variables
res[, sl2 := sl**2]
res
}
# reading with preprocessing
hdd_path_preprocess = tempfile()
txt2hdd(iris_path, hdd_path_preprocess,
preprocessfun = fun, rowsPerChunk = 50)
base_hdd_preprocess = hdd(hdd_path_preprocess)
summary(base_hdd_preprocess)