tcrossprod_sparse {RNewsflow} | R Documentation |
tcrossprod with benefits, for people that like parameters
Description
This function (including the underlying cpp function batched_tcrossprod_cpp) is the workhorse of the RNewsflow package. It has unnervingly many arguments for a tcrossprod because it needs to be able to do many thing efficiently. While its mostly a backend function, we expose it because it has applications outside of RNewsflow, but we make no excuses for the fact that readability is very much sacrificed here for the convenience of being able to keep adding features that we need for RNewsflow.
Usage
tcrossprod_sparse(
m,
m2 = NULL,
min_value = NULL,
max_value = NULL,
only_upper = F,
diag = T,
top_n = NULL,
rowsum_div = F,
max_p = 1,
pvalue = c("disparity", "normal", "lognormal", "nz_normal", "nz_lognormal"),
normalize = c("none", "l2", "softl2"),
crossfun = c("prod", "min", "softprod", "maxproduct", "lookup", "cp_lookup",
"cp_lookup_norm"),
group = NULL,
group2 = NULL,
date = NULL,
date2 = NULL,
lwindow = -1,
rwindow = 1,
date_unit = c("days", "hours", "minutes", "seconds"),
simmat = NULL,
simmat_thres = NULL,
row_attr = F,
col_attr = F,
lag_attr = F,
batchsize = 1000,
verbose = F
)
Arguments
m |
A CsparseMatrix |
m2 |
A CsparseMatrix |
min_value |
Optionally, a numerical value, specifying the threshold for including a score in the output. |
max_value |
Optionally, a numerical value for the upper limit for including a score in the output. |
only_upper |
If true, only the upper triangle of the matrix is returned. Only possible for symmetrical output (m and m2 have same number of columns) |
diag |
If false, the diagonal of the matrix is not returned. Only possible for symmetrical output (m and m2 have same number of columns) |
top_n |
An integer, specifying the top number of strongest similarities per row. So, for each row in m at most top_n scores are returned.. |
rowsum_div |
If true, divide crossproduct by column sums of m. (this has to happen within the loop for min_value and top_n filtering). |
max_p |
A threshold for maximium p value. |
pvalue |
If max_p < 1, edges are removed based on a p value. For each document in dtm, a p value is calculated over its outward edges. Default is the p-value based on uniform distribution, akin to a "disparity" filter (see Serrano et al., DOI: 10.1073/pnas.0808904106) but without filtering on inward edges. |
normalize |
Normalize rows by a given norm score (before calculating similarity). Default is 'none' (no normalization). 'l2' is the l2 norm (use in combination with 'prod' crossfun for cosine similarity). 'l2soft' is the adaptation of l2 for soft similarity (use in combination with 'softprod' crossfun for soft cosine). |
crossfun |
The function used in the vector operations. Normally this is the "prod", for product (dot product). Here we also allow the "min", for minimum value. We use this in our document overlap_pct score. In addition, there is the (experimental) softprod, that can be used in combination with softl2 normalization to get the soft cosine similarity. The "maxproduct" is a special case used in the query_lookup measure, that uses product but only returns the score of the strongest matching term. The "cp_lookup" and "cp_lookup_norm" are special cases for conditional probability sensitive lookup. |
group |
Optionally, a character vector that specifies a group (e.g., source) for each row in m. If given, only pairs of rows with the same group are calculated. |
group2 |
If m2 and group are used, group2 has to be used to specify the groups for the rows in m2 (otherwise group will be ignored) |
date |
Optionally, a POSIXct vector (or a vector that can be converted to as.POSIXct) that specifies a date for each row in m. If given, only pairs of rows within a given date range (see lwindow, rwindow and date_unit) are calculated. |
date2 |
If m2 and date are used, date2 has to be used to specify the date for the rows in m2 (otherwise date will be ignored) |
lwindow |
If date (and date2) are used, lwindow determines the left side of the date window. e.g. -10 means that rows are only matched with rows for which date is within 10 [date_units] before. |
rwindow |
Like lwindow, but for the right side. e.g. an lwindow of -1 and rwindow of 1, with date_unit is "days", means that only rows are matched for which the dates are within a 1 day distance |
date_unit |
The date unit used in lwindow and rwindow. Supports "days", "hours", "minutes" and "seconds". Note that refers to the time distance between two rows ("days" doesn't refer to calendar days, but to a time of 24 hours) |
simmat |
If softcos is used, a symmetric matrix with terms that indicates the similarity of terms (i.e. adjacency matrix). If NULL, a cosine similarity matrix will be created on the go |
simmat_thres |
If softcos is used, a threshold for the term similarity. |
row_attr |
If TRUE, add the "row_n" and "row_sum" elements to the "margin" attribute. |
col_attr |
Like row_attr, but adding "col_n" and "col_sum" to the "margin" attribute. |
lag_attr |
If TRUE, adds "lag_n" and "lag_sum" to the "margin" attribute. These are the margin scores for rows, where the date of the column is before (lag) the date of the row. Only possible if date argument is given. |
batchsize |
If group and/or date are used, size of batches. |
verbose |
if TRUE, report progress |
Details
Enables limiting row combinations to within specified groups and date windows, and filters results that do not pass the threshold on the fly. To achieve this, options for similarity measures are included in the function. For example, to get the cosine similarity, you can normalize with "l2" and use the "prod" (product) function for the
This function is called by the document comparison functions (newsflow_compare, delete_duplicates). We only expose it here for additional flexibility, and because it could be usefull outside of the purpose of this package.
The output matrix also has an attribute "margin", which contains margin scores (e.g., row_sum) if the row_attr or col_attr arguments are used. The reason for including this is that some values that are normally available in the output of a cross product are broken if certain filter options are used. If group or date is used, we don't know how many columns a rows has been compared to (normally this is all columns). If a min/max or top_n filter is used, we don't know the true row sums (and thus row means).
Value
A CsparseMatrix
Examples
set.seed(1)
m = Matrix::rsparsematrix(5,10,0.5)
tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = TRUE)
tcrossprod_sparse(m, min_value = 0, only_upper = FALSE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0.2, only_upper = TRUE, diag = FALSE)
tcrossprod_sparse(m, min_value = 0, only_upper = TRUE, diag = FALSE, top_n = 1)