write.sparse {readsparse}R Documentation

Write Sparse Matrix in Text Format

Description

Write a labelled sparse matrix into text format as used by software such as SVMLight, LibSVM, ThunderSVM, LibFM, xLearn, XGBoost, LightGBM, and others - i.e.:

<labels(s)> <column:value> <column:value> ...

For more information about the format and usage examples, see read.sparse.

Can write labels for regression, classification (binary, multi-class, and multi-label), and ranking (with 'qid'), but note that most software that makes use of this data format supports only regression and binary classification.

Usage

write.sparse(
  file,
  X,
  y,
  qid = NULL,
  integer_labels = TRUE,
  index1 = TRUE,
  sort_indices = TRUE,
  ignore_zeros = TRUE,
  add_header = FALSE,
  decimal_places = 8L,
  append = FALSE,
  to_string = FALSE
)

Arguments

file

Output file path into which to write the data. Will be ignored when passing 'to_string=TRUE'.

X

Sparse data to write. Can be a sparse matrix from package 'Matrix' (classes: 'dgRMatrix', 'dgTMatrix', 'dgCMatrix', 'ngRMatrix', 'ngTMatrix', 'ngCMatrix') or from package 'SparseM' (classes: 'matrix.csr', 'matrix.coo', 'matrix.csc'), or a dense matrix of all numeric values, passed either as a 'matrix' or as a 'data.frame'.

If 'X' is a vector (classes 'numeric', 'integer', 'dsparseVector'), will be assumed to be a row vector and will thus write one row only.

Note that the data will be casted to 'dgRMatrix' in any case.

y

Labels for the data. Can be passed as a vector ('integer' or 'numeric') if each observation has one label, or as a sparse or dense matrix (same format as 'X') if each observation can have more than 1 label. In the latter case, only the non-missing column indices will be written, while the values are ignored.

qid

Secondary label information used for ranking algorithms. Must be an integer vector if passed. Note that not all software supports this.

integer_labels

Whether to write the labels as integers. If passing 'FALSE', they will have a decimal point regardless of whether they are integers or not. If the file is meant to be used for a classification algorithm, one should pass 'TRUE' here (the default). For multilabel classification, the labels will always be written as integers.

index1

Whether the column and label indices (if multi-label) should have numeration starting at 1. Most software assumes this is 'TRUE'.

sort_indices

Whether to sort the indices of 'X' (and of 'y' if multi-label) before writing the data. Note that this will cause in-place modifications if either 'X' or 'y' are passed as CSR matrices from the 'Matrix' package.

ignore_zeros

Whether to ignore (not write) features with a value of zero after rounding to the specified decimal places.

add_header

Whether to add a header with metadata as the first line (number of rows, number of columns, number of classes). If passing 'integer_label=FALSE' and 'y' is a vector, will write zero as the number of labels. This is not supported by most software.

decimal_places

Number of decimal places to use for numeric values. All values will have exactly this number of places after the decimal point. Be aware that values are rounded and might turn to zeros (will be skipped by default) if they are too small (one can do something like 'X@x <- ifelse(X@x >= 0, pmin(X@x, 1e-8), pmax(X@x, -1e-8))' to avoid this).

append

Whether to append text at the end of the file instead of overwriting or creating a new file. Ignored when passing 'to_string=TRUE'.

to_string

Whether to write the result into a string (which will be returned from the function) instead of into a file.

Details

Be aware that writing sparse matrices to text is not a lossless operation - that is, some information might be lost due to numeric precision, and metadata such as row and column names will not be saved. It is recommended to use 'saveRDS' or similar for saving data between R sessions, or to use binary formats for passing between different software such as R->Python.

The option 'ignore_zeros' is implemented heuristically, by comparing 'abs(x) >= 10^(-decimal_places)/2', which might not match exactly with the rounding that is done implicitly in string conversions in the libc/libc++ functions - thus there might still be some corner cases of all-zeros written into features if the (absolute) values are very close to the rounding threshold.

While R uses C 'double' type for numeric values, most of the software that is able to take input data in this format uses 'float' type, which has less precision.

The function uses different code paths when writing to a file or to a string, and there might be slight differences between the generated texts from them. If any such difference is encountered, please submit a bug report in the package's GitHub page.

Value

If passing 'to_string=FALSE' (the default), will not return anything ('invisible(NULL)'). If passing 'to_string=TRUE', will return a 'character' variable with the data contents written into it.

See Also

read.sparse


[Package readsparse version 0.1.5-6 Index]