simil {proxyC} | R Documentation |
Compute similarity/distance between rows or columns of large matrices
Description
Fast similarity/distance computation function for large sparse matrices. You
can floor small similarity value to to save computation time and storage
space by an arbitrary threshold (min_simil
) or rank (rank
). You
can specify the number of threads for parallel computing via
options(proxyC.threads)
.
Usage
simil(
x,
y = NULL,
margin = 1,
method = c("cosine", "correlation", "jaccard", "ejaccard", "fjaccard", "dice", "edice",
"hamann", "faith", "simple matching"),
min_simil = NULL,
rank = NULL,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
digits = 14
)
dist(
x,
y = NULL,
margin = 1,
method = c("euclidean", "chisquared", "kullback", "jeffreys", "jensen", "manhattan",
"maximum", "canberra", "minkowski", "hamming"),
p = 2,
smooth = 0,
drop0 = FALSE,
diag = FALSE,
use_nan = NULL,
digits = 14
)
Arguments
x |
matrix or Matrix object. Dense matrices are covered to the CsparseMatrix-class internally. |
y |
if a matrix or Matrix object is provided, proximity
between documents or features in |
margin |
integer indicating margin of similarity/distance computation. 1 indicates rows or 2 indicates columns. |
method |
method to compute similarity or distance |
min_simil |
the minimum similarity value to be recorded. |
rank |
an integer value specifying top-n most similarity values to be recorded. |
drop0 |
if |
diag |
if |
use_nan |
if |
digits |
determines rounding of small values towards zero. Use primarily to correct rounding errors in C++. See zapsmall. |
p |
weight for Minkowski distance |
smooth |
adds a fixed value to all the cells to avoid division by zero.
Only used when |
Details
Available Methods
Similarity:
-
cosine
: cosine similarity -
correlation
: Pearson's correlation -
jaccard
: Jaccard coefficient -
ejaccard
: the real value version ofjaccard
-
fjaccard
: Fuzzy Jaccard coefficient -
dice
: Dice coefficient -
edice
: the real value version ofdice
-
hamann
: Hamann similarity -
faith
: Faith similarity -
simple matching
: the percentage of common elements
Distance:
-
euclidean
: Euclidean distance -
chisquared
: chi-squared distance -
kullback
: Kullback–Leibler divergence -
jeffreys
: Jeffreys divergence -
jensen
: Jensen–Shannon divergence -
manhattan
: Manhattan distance -
maximum
: the largest difference between values -
canberra
: Canberra distance -
minkowski
: Minkowski distance -
hamming
: Hamming distance
See the vignette for how the similarity and distance are computed:
vignette("measures", package = "proxyC")
Parallel Computing
It performs parallel computing using Intel oneAPI Threads Building Blocks.
The number of threads for parallel computing should be specified via
options(proxyC.threads)
before calling the functions. If the value is -1,
all the available threads will be used. Unless the option is used, the
number of threads will be limited by the environmental variables
(OMP_THREAD_LIMIT
or RCPP_PARALLEL_NUM_THREADS
) to comply with CRAN
policy and offer backward compatibility.
See Also
zapsmall
Examples
mt <- Matrix::rsparsematrix(100, 100, 0.01)
simil(mt, method = "cosine")[1:5, 1:5]
mt <- Matrix::rsparsematrix(100, 100, 0.01)
dist(mt, method = "euclidean")[1:5, 1:5]