R: Automatically detect data models for CSV-files

detect_dm_csv {LaF}

R Documentation

Automatically detect data models for CSV-files

Description

Automatically detect data models for CSV-files. Opening of files using the data models can be done using laf_open.

Usage

detect_dm_csv(
  filename,
  sep = ",",
  dec = ".",
  header = FALSE,
  nrows = 1000,
  nlines = NULL,
  sample = FALSE,
  stringsAsFactors = TRUE,
  factor_fraction = 0.4,
  ...
)

Arguments

`filename`	character containing the filename of the csv-file.
`sep`	character vector containing the separator used in the file.
`dec`	the character used for decimal points.
`header`	does the first line in the file contain the column names.
`nrows`	the number of lines that should be read in to detect the column types. The more lines the more likely that the correct types are detected.
`nlines`	(only needed when the sample option is used) the expected number of lines in the file. If not specified the number of lines in the file is first calculated.
`sample`	by default the first `nrows` lines are read in for determining the column types. When sample is used random lines from the file are used. This is more robust, but takes longer.
`stringsAsFactors`	passed on to `read.table`. Set to `FALSE` to read all text columns as character. In that case `factor_fraction` is ignored.
`factor_fraction`	the fraction of unique string in a column below which the column is converted to a factor/categorical. For more information see details.
`...`	additional arguments are passed on to `read.table`. However, be careful with using these as some of these arguments are not supported by `laf_open_csv`.

Details

The argument factor_fraction determines the fraction of unique strings below which the column is converted to factor/categorical. If all column need to be converted to character a value larger than one can be used. A value smaller than zero will ensure that all columns will be converted to categorical. Note that LaF stores the levels of a categorical in memory. Therefore, for categorical columns with a very large number of (almost) unique levels can cause memory problems.

Value

read_dm returns a data model which can be used by laf_open. The data model can be written to file using write_dm.

Examples

# Create temporary filename
tmpcsv  <- tempfile(fileext="csv")

# Generate test data
ntest <- 10
column_types <- c("integer", "integer", "double", "string")
testdata <- data.frame(
    a = 1:ntest,
    b = sample(1:2, ntest, replace=TRUE),
    c = round(runif(ntest), 13),
    d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE),
    stringsAsFactors = FALSE
    )
# Write test data to csv file
write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=TRUE, sep=',')

# Detect data model
model <- detect_dm_csv(tmpcsv, header=TRUE)

# Create LaF-object
laf <- laf_open(model)

# Cleanup
file.remove(tmpcsv)

[Package LaF version 0.8.4 Index]