cosSparse {qlcMatrix} | R Documentation |
Cosine similarity between columns (sparse matrices)
Description
cosSparse
computes the cosine similarity between the columns of sparse matrices. Different normalizations and weightings can be specified. Performance-wise, this strongly improves over the approach taken in the corSparse
function, though the results are almost identical for large sparse matrices. cosMissing
adds the possibility to deal with large amounts of missing data.
Usage
cosSparse(X, Y = NULL, norm = norm2, weight = NULL)
cosMissing(X, availX, Y = NULL, availY = NULL, norm = norm2 , weight = NULL)
Arguments
X |
a sparse matrix in a format of the |
Y |
a second matrix in a format of the |
availX , availY |
sparse Matrices (typically pattern matrices) of the same size of X and Y, respectively, indicating the available information for each matrix. |
norm |
The function to be used for the normalization of the columns. Currently |
weight |
The function to be used for the weighting of the rows. Currently |
Details
This measure is called a ‘cosine’ similarity as it computes the cosine of the angle between high-dimensional vectors. It can also be considered a Pearson correlation without centering. Because centering removes sparsity, and because centering has almost no influence for highly sparse matrices, this cosine similarity performs much better that the Pearson correlation, both related for speed and memory consumption.
The variant cosMissing
can be used when the available information itself is also sparse. In such a situation, a zero in the data matrix X, Y can mean either ‘zero value’ or ‘missing data’. To deal with the missing data, matrices indicating the available data can be specified. Note that this really only makes sense when the available data is sparse itself. When, say, 90% of the data is available, the availX
matrix becomes very large, and the results does not differ strongly from the regular cosSparse
, i.e. ignoring the missing data.
Different normalizations of the columns and weightings of the rows can be specified.
The predefined normalizations are defined as a function of the matrix x and a ‘summation function’ s (to be specified as a sparse matrix or a vector). This slight complexity is needed to be able to deal with missing data. With complete data, then s = rep(1,nrow(X))
, leads to crossprod(X,s) == colSums(X)
.
norm2
: euclidean norm. The default setting, and the same normalization as used in the Pearson correlation. It is defined as
norm2 <- function(x,s) { drop(crossprod(x^2,s)) ^ 0.5 }
.norm1
: Manhattan, or taxi-cab norm, defined as
norm1 <- function(x,s) { abs(drop(crossprod(x,s))) }
.normL
: normalized Laplacian norm, used in spectral clustering of a graph, defined as
normL <- function(x,s) { abs(drop(crossprod(x,s))) ^ 0.5 }
.
The predefined weightings are defined as a function of the frequency of a row (s) and the number of columns (N):
idf
: inverse document frequency, used typically in distributional semantics to down-weight high frequent rows. It is defined as
idf <- function(s,N) { log(N/(1+s)) }
.isqrt
: inverse square root, an alternative to idf, defined as
isqrt <- function(s,N) { s^-0.5 }
.none
: no weighting. This is only included for use inside later high-level functions (e.g.sim.words
). Normally,weight = NULL
gives identical results, but is slightly quicker.
none <- function(s,N) { s }
Further norms of weighting functions can be defined at will.
Value
The result is a sparse matrix with the non-zero association values. Values range between -1 and +1, with values close to zero indicating low association.
When Y = NULL
, then the result is a symmetric matrix, so a matrix of type dsCMatrix
with size ncol(X)
by ncol{X}
is returned. When X
and Y
are both specified, a matrix of type dgCMatrix
with size ncol(X)
by ncol{Y}
is returned.
Note
For large sparse matrices, consider this as an alternative to cor
. See corSparse
for a comparison of performance and results.
Author(s)
Michael Cysouw
See Also
corSparse
, assocSparse
for other sparse association measures. See also cosRow, cosCol
for variants of cosSparse dealing with nominal data.
Examples
# reasonable fast on modern hardware
# try different sizes to find limits on local machine
system.time(X <- rSparseMatrix(1e8, 1e8, 1e6))
system.time(M <- cosSparse(X))
# consider removing small values of result to improve sparsity
X <- rSparseMatrix(1e5, 1e5, 1e6)
print(object.size(X), units = "auto") # 12 Mb
system.time(M <- cosSparse(X))
print(object.size(M), units = "auto") # 59 Mb
M <- drop0(M, tol = 0.1) # remove small values
print(object.size(M), units = "auto") # 14 Mb
# Compare various weightings
# with random data from a normal distribution there is almost no difference
#
# data from a normal distribution
X <- rSparseMatrix(1e2, 1e2, 1e3)
w0 <- cosSparse(X, norm = norm2, weight = NULL)@x
wi <- cosSparse(X, norm = norm2, weight = idf)@x
ws <- cosSparse(X, norm = norm2, weight = isqrt)@x
pairs(~ w0 + wi + ws,
labels=c("no weighting","inverse\ndocument\nfrequency","inverse\nsquare root"))
# with heavily skewed data there is a strong difference!
X <- rSparseMatrix(1e2, 1e2, 1e3,
rand.x = function(n){round(rpois(1e3, 10), 2)})
w0 <- cosSparse(X, norm = norm2, weight = NULL)@x
wi <- cosSparse(X, norm = norm2, weight = idf)@x
ws <- cosSparse(X, norm = norm2, weight = isqrt)@x
pairs(~ w0 + wi + ws,
labels=c("no weighting","inverse\ndocument\nfrequency","inverse\nsquare root"))