detect_dm_csv {LaF}R Documentation

Automatically detect data models for CSV-files

Description

Automatically detect data models for CSV-files. Opening of files using the data models can be done using laf_open.

Usage

detect_dm_csv(
  filename,
  sep = ",",
  dec = ".",
  header = FALSE,
  nrows = 1000,
  nlines = NULL,
  sample = FALSE,
  stringsAsFactors = TRUE,
  factor_fraction = 0.4,
  ...
)

Arguments

filename

character containing the filename of the csv-file.

sep

character vector containing the separator used in the file.

dec

the character used for decimal points.

header

does the first line in the file contain the column names.

nrows

the number of lines that should be read in to detect the column types. The more lines the more likely that the correct types are detected.

nlines

(only needed when the sample option is used) the expected number of lines in the file. If not specified the number of lines in the file is first calculated.

sample

by default the first nrows lines are read in for determining the column types. When sample is used random lines from the file are used. This is more robust, but takes longer.

stringsAsFactors

passed on to read.table. Set to FALSE to read all text columns as character. In that case factor_fraction is ignored.

factor_fraction

the fraction of unique string in a column below which the column is converted to a factor/categorical. For more information see details.

...

additional arguments are passed on to read.table. However, be careful with using these as some of these arguments are not supported by laf_open_csv.

Details

The argument factor_fraction determines the fraction of unique strings below which the column is converted to factor/categorical. If all column need to be converted to character a value larger than one can be used. A value smaller than zero will ensure that all columns will be converted to categorical. Note that LaF stores the levels of a categorical in memory. Therefore, for categorical columns with a very large number of (almost) unique levels can cause memory problems.

Value

read_dm returns a data model which can be used by laf_open. The data model can be written to file using write_dm.

See Also

See write_dm to write the data model to file. The data models can be used to open a file using laf_open.

Examples

# Create temporary filename
tmpcsv  <- tempfile(fileext="csv")

# Generate test data
ntest <- 10
column_types <- c("integer", "integer", "double", "string")
testdata <- data.frame(
    a = 1:ntest,
    b = sample(1:2, ntest, replace=TRUE),
    c = round(runif(ntest), 13),
    d = sample(c("jan", "pier", "tjores", "corneel"), ntest, replace=TRUE),
    stringsAsFactors = FALSE
    )
# Write test data to csv file
write.table(testdata, file=tmpcsv, row.names=FALSE, col.names=TRUE, sep=',')

# Detect data model
model <- detect_dm_csv(tmpcsv, header=TRUE)

# Create LaF-object
laf <- laf_open(model)

# Cleanup
file.remove(tmpcsv)


[Package LaF version 0.8.4 Index]