dsm {wordspace} | R Documentation |
Create DSM Object Representing a Distributional Semantic Model (wordspace)
Description
This is the constructor function for dsm
objects representing distributional semantic models,
i.e. a co-occurrence matrix together with additional information on target terms (rows) and features (columns).
A new DSM can be initialised with a dense or sparse co-occurrence matrix, or with a triplet representation of a sparse matrix.
Usage
dsm(M = NULL, target = NULL, feature = NULL, score = NULL,
rowinfo = NULL, colinfo = NULL, N = NA,
globals = list(), raw.freq = FALSE, sort = FALSE, verbose = FALSE)
Arguments
M |
a dense or sparse co-occurrence matrix. A sparse matrix must be a subclass of |
target |
a character vector of target terms (see "Details" below) |
feature |
a character vector of feature terms (see "Details" below) |
score |
a numeric vector of co-occurrence frequencies or weighted/transformed scores (see "Details" below) |
rowinfo |
a data frame containing information about the rows of the co-occurrence matrix, corresponding to target terms. The data frame must include a column |
colinfo |
a data frame containing information about the columns of the co-occurrence matrix, corresponding to feature terms. The data frame must include a column |
N |
a single numeric value specifying the effective sample size of the co-occurrence matrix. This value may be determined automatically if |
globals |
a list of global variables, which are included in the |
raw.freq |
if |
sort |
if |
verbose |
if |
Details
The co-occurrence matrix forming the core of the distributional semantic model (DSM) can be specified in two different ways:
-
As a dense or sparse matrix in argument
M
. A sparse matrix must be a subclass ofdMatrix
(from theMatrix
package) and is automatically converted to the canonical storage mode used by thewordspace
package. Row and column labels may be specified with argumentstarget
andfeature
, which must be character vectors of suitable length; otherwisedimnames(M)
are used. -
As a triplet representation in arguments
target
(row label),feature
(column label) andscore
(co-occurrence frequency or pre-computed score). The three arguments must be vectors of the same length; each set of corresponding elements specifies a non-zero cell of the co-occurrence matrix. If multiple entries for the same cell are given, their frequency or score values are added up.
The optional arguments rowinfo
and colinfo
are data frames with additional information about target and feature terms. If they are specified, they must contain a column $term
matching the row or column labels of the co-occurrence matrix. Marginal frequencies and nonzero or document counts can be given in columns $f
and $nnzero
; any further columns are interpreted as meta-information on the target or feature terms. The rows of each data frame are automatically reordered to match the rows or columns of the co-occurrence matrix. Target or feature terms that do not appear in the co-occurrence matrix are silently discarded.
Counts of nonzero cells for each row and column are computed automatically, unless they are already present in the rowinfo
and colinfo
data frames. If the co-occurrence matrix contains raw frequency values, marginal frequencies for the target and feature terms are also computed automatically unless given in rowinfo
and colinfo
; the same holds for the effective sample size N
.
If raw.freq=TRUE
, all matrix entries must be non-negative; fractional frequency counts are allowed, however.
Value
An object of class dsm
, a list with the following components:
M |
A co-occurrence matrix of raw frequency counts in canonical format (see |
S |
A weighted and transformed co-occurrence matrix ("score" matrix) in canonical format (see |
rows |
A data frame with information about the target terms, corresponding to the rows of the co-occurrence matrix. The data frame usually has at least three columns:
Further columns may provide additional information. |
cols |
A data frame with information about the feature terms, corresponding to the columns of the co-occurrence matrix, in the same format as |
globals |
A list of global variables. The following variables have a special meaning:
|
Author(s)
Stephanie Evert (https://purl.org/stephanie.evert)
See Also
See dsm.canonical.matrix
for a description of the canonical matrix formats. DSM objects are usually loaded directly from a disk file in UCS (read.dsm.ucs
) or triplet (read.dsm.triplet
) format.
Examples
MyDSM <- dsm(
target = c("boat", "boat", "cat", "dog", "dog"),
feature = c("buy", "use", "feed", "buy", "feed"),
score = c(1, 3, 2, 1, 1),
raw.freq = TRUE
)
print(MyDSM) # 3 x 3 matrix with 5 out of 9 nonzero cells
print(MyDSM$M) # the actual co-occurrence matrix
print(MyDSM$rows) # row information
print(MyDSM$cols) # column information