write.sparse {readsparse} | R Documentation |
Write Sparse Matrix in Text Format
Description
Write a labelled sparse matrix into text format as used by software such as SVMLight, LibSVM, ThunderSVM, LibFM, xLearn, XGBoost, LightGBM, and others - i.e.:
<labels(s)> <column:value> <column:value> ...
For more information about the format and usage examples, see read.sparse.
Can write labels for regression, classification (binary, multi-class, and multi-label), and ranking (with 'qid'), but note that most software that makes use of this data format supports only regression and binary classification.
Usage
write.sparse(
file,
X,
y,
qid = NULL,
integer_labels = TRUE,
index1 = TRUE,
sort_indices = TRUE,
ignore_zeros = TRUE,
add_header = FALSE,
decimal_places = 8L,
append = FALSE,
to_string = FALSE
)
Arguments
file |
Output file path into which to write the data. Will be ignored when passing 'to_string=TRUE'. |
X |
Sparse data to write. Can be a sparse matrix from package 'Matrix' (classes: 'dgRMatrix', 'dgTMatrix', 'dgCMatrix', 'ngRMatrix', 'ngTMatrix', 'ngCMatrix') or from package 'SparseM' (classes: 'matrix.csr', 'matrix.coo', 'matrix.csc'), or a dense matrix of all numeric values, passed either as a 'matrix' or as a 'data.frame'. If 'X' is a vector (classes 'numeric', 'integer', 'dsparseVector'), will be assumed to be a row vector and will thus write one row only. Note that the data will be casted to 'dgRMatrix' in any case. |
y |
Labels for the data. Can be passed as a vector ('integer' or 'numeric') if each observation has one label, or as a sparse or dense matrix (same format as 'X') if each observation can have more than 1 label. In the latter case, only the non-missing column indices will be written, while the values are ignored. |
qid |
Secondary label information used for ranking algorithms. Must be an integer vector if passed. Note that not all software supports this. |
integer_labels |
Whether to write the labels as integers. If passing 'FALSE', they will have a decimal point regardless of whether they are integers or not. If the file is meant to be used for a classification algorithm, one should pass 'TRUE' here (the default). For multilabel classification, the labels will always be written as integers. |
index1 |
Whether the column and label indices (if multi-label) should have numeration starting at 1. Most software assumes this is 'TRUE'. |
sort_indices |
Whether to sort the indices of 'X' (and of 'y' if multi-label) before writing the data. Note that this will cause in-place modifications if either 'X' or 'y' are passed as CSR matrices from the 'Matrix' package. |
ignore_zeros |
Whether to ignore (not write) features with a value of zero after rounding to the specified decimal places. |
add_header |
Whether to add a header with metadata as the first line (number of rows, number of columns, number of classes). If passing 'integer_label=FALSE' and 'y' is a vector, will write zero as the number of labels. This is not supported by most software. |
decimal_places |
Number of decimal places to use for numeric values. All values will have exactly this number of places after the decimal point. Be aware that values are rounded and might turn to zeros (will be skipped by default) if they are too small (one can do something like 'X@x <- ifelse(X@x >= 0, pmin(X@x, 1e-8), pmax(X@x, -1e-8))' to avoid this). |
append |
Whether to append text at the end of the file instead of overwriting or creating a new file. Ignored when passing 'to_string=TRUE'. |
to_string |
Whether to write the result into a string (which will be returned from the function) instead of into a file. |
Details
Be aware that writing sparse matrices to text is not a lossless operation - that is, some information might be lost due to numeric precision, and metadata such as row and column names will not be saved. It is recommended to use 'saveRDS' or similar for saving data between R sessions, or to use binary formats for passing between different software such as R->Python.
The option 'ignore_zeros' is implemented heuristically, by comparing 'abs(x) >= 10^(-decimal_places)/2', which might not match exactly with the rounding that is done implicitly in string conversions in the libc/libc++ functions - thus there might still be some corner cases of all-zeros written into features if the (absolute) values are very close to the rounding threshold.
While R uses C 'double' type for numeric values, most of the software that is able to take input data in this format uses 'float' type, which has less precision.
The function uses different code paths when writing to a file or to a string, and there might be slight differences between the generated texts from them. If any such difference is encountered, please submit a bug report in the package's GitHub page.
Value
If passing 'to_string=FALSE' (the default), will not return anything ('invisible(NULL)'). If passing 'to_string=TRUE', will return a 'character' variable with the data contents written into it.