sampleCSV {DMwR2} | R Documentation |
Drawing a random sample of lines from a CSV file
Description
Function for obtaining a random sample of lines from a very large CSV
file, whitout having to load in the full data into memory. Targets
situations where the full data does not fit in the computer memory so
usage of the standard sample
function is not possible.
Usage
sampleCSV(file, percORn, nrLines, header=TRUE, mxPerc=0.5)
Arguments
file |
A file name (a string) |
percORn |
Either the percentage of number of rows of the file or the actual number of rows, the sample should have |
nrLines |
Optionally you may indicate the number of rows of the file if you know it before-hand, otherwise the function will count them for you |
header |
Whether the file has a header line or not (a Boolean value) |
mxPerc |
A maximum threshold for the percentage the sample is allowed to have (defaults to 0.5) |
Details
This function can be used to draw a random sample of lines from a very
large CSV file. This is particularly usefull when you can not afford
to load the file into memory to use R functions like sample
to
obtain the sample.
The function obtains the sample of rows without actually loading the full data into memory - only the final sample is loaded into main memory.
The function is based on unix-based utility programs (perl
and wc
) so
it is limited to this type of platforms. The function will not run on
other platforms (it will check the system variable .Platform$OS.type
), although you may wish to check the function code and
see if you can adapt it to your platform.
Value
A data frame
Author(s)
Luis Torgo ltorgo@dcc.fc.up.pt
References
Torgo, L. (2016) Data Mining using R: learning with case studies, second edition, Chapman & Hall/CRC (ISBN-13: 978-1482234893).
See Also
nrLinesFile
, sample
, sampleDBMS