pwMatrix {qlcMatrix} | R Documentation |
Construct ‘part-whole’ (pw) Matrices from tokenized strings
Description
A part-whole Matrix is a sparse matrix representation of a vector of strings (‘wholes’) split into smaller parts by a specified separator. It basically summarizes which strings consist of which parts. By itself, this is not a very interesting transformation, but it allows for quite fancy computations by simple matrix manipulations.
Usage
pwMatrix(strings, sep = "", gap.length = 0, gap.symbol = "\u2043", simplify = FALSE)
Arguments
strings |
a vector (or list) of strings to be separated into parts |
sep |
The separator to be used. Defaults to space |
gap.length |
This adds the specified number of gap symbols between each pair of strings. This is only important for generating higher ngram-statistics later on, when no ordering of the strings is implied. For example, when the strings are alphabetically ordered words, any bigram-statistics should not count the bigrams consisting of the last character of the a word with the first character of the next word. |
gap.symbol |
The gap symbol to insert (see gap.length above). It defaults to U+2043 |
simplify |
by default, the row and column names are not included into the matrix to keep the matrix as lean as possible. The row names (‘parts’) are returned separately. Using |
Details
Internally, this is basically using strsplit
and some cosmetic changes, returning a sparse matrix.
Value
By default (when simplify = F
) the output is a list with two elements, containing:
M |
a sparse pattern Matrix of type |
rownames |
all different characters from the strings in order (i.e. all individual tokens of the original strings). |
When simplify = T
, then only the matrix M with row and column names is returned.
Author(s)
Michael Cysouw
See Also
Used in splitStrings
and splitWordlist
Examples
# By itself, this functions does nothing really interesting
example <- c("this","is","an","example")
pw <- pwMatrix(example)
pw
# However, making a type-token Matrix (with ttMatrix) of the rownames
# and then taking a matrix product, results in frequencies of each element in the strings
tt <- ttMatrix(pw$rownames)
distr <- (tt$M*1) %*% (pw$M*1)
rownames(distr) <- tt$rownames
colnames(distr) <- example
distr
# Use banded sparse matrix with superdiagonal ('shift matrix') to get co-occurrence counts
# of adjacent characters. Rows list first character, columns adjacent character.
# Non-zero entries list number of co-occurrences
S <- bandSparse( n = ncol(tt$M), k = 1) * 1
TT <- tt$M * 1
( C <- TT %*% S %*% t(TT) )
# show the non-zero entries as triplets:
s <- summary(C)
first <- tt$rownames[s[,1]]
second <- tt$rownames[s[,2]]
freq <- s[,3]
data.frame(first,second,freq)