read.dsm.ucs {wordspace} | R Documentation |
Load Raw DSM Data from Disk Files in UCS Export Format (wordspace)
Description
This function loads raw DSM data – a cooccurrence frequency matrix and tables of marginal frequencies – in UCS export format. The data are read from a directory containing several text files with predefined names, which can optionally be compressed (see ‘File Format’ below for details).
Usage
read.dsm.ucs(filename, encoding = getOption("encoding"), verbose = FALSE)
Arguments
filename |
the name of a directory containing files with the raw DSM data. |
encoding |
character encoding of the input files, which will automatically be converted to R's internal representation if possible. See ‘Encoding’ in |
verbose |
if |
Value
An object of class dsm
containing a dense or sparse DSM.
Note that the information tables for target terms (field rows
) and feature terms (field cols
) include the correct marginal frequencies from the UCS export files. Nonzero counts for rows are and columns are added automatically unless they are already present in the disk files. Additional fields from the information tables as well as all global variables are preserved with their original names.
File Format
The UCS export format is a directory containing the following files with the specified names:
-
‘M’ or ‘M.mtx’
cooccurrence matrix (dense, plain text) or sparse matrix (MatrixMarket format)
-
‘rows.tbl’
row information (labels
term
, marginal frequenciesf
) -
‘cols.tbl’
column information (labels
term
, marginal frequenciesf
) -
‘globals.tbl’
table with single row containing global variables; must include variable
N
specifying sample size
Each individual file may be compressed with an additional filename extension .gz
, .bz2
or .xz
; read.dsm.ucs
automatically decompresses such files when loading them.
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
References
The UCS toolkit is a software package for collecting and manipulating co-occurrence data available from http://www.collocations.de/software.html.
UCS relies on compressed text files as its main storage format. They can be exported as a DSM with ucs-tool export-dsm-matrix
.