newsflow_compare {RNewsflow} | R Documentation |
Create a network of document similarities over time
Description
This is a wrapper for the compare_documents
function, specialised for the case of analyzing documents over time.
The difference is that using date_var is mandatory, and the output is returned as an igraph network (using as_document_network
).
Usage
newsflow_compare(
dtm,
dtm_y = NULL,
date_var = "date",
hour_window = c(-24, 24),
group_var = NULL,
measure = c("cosine", "overlap_pct", "overlap", "dot_product", "softcosine"),
tf_idf = F,
min_similarity = 0,
n_topsim = NULL,
only_complete_window = T,
...
)
Arguments
dtm |
A quanteda dfm. Note that it is common to first weight the dtm(s) before calculating document similarity, For this you can use quanteda's dfm_tfidf and dfm_weight |
dtm_y |
Optionally, another dtm. If given, the documents in dtm will be compared to the documents in dtm_y. |
date_var |
The name of the column in meta that specifies the document date. default is "date". The values should be of type POSIXct, or coercable with as.POSIXct. If given, the hour_window argument is used to only compare documents within a time window. |
hour_window |
A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. It is possible to specify time windows down to the level of seconds by using fractions (hours / 60 / 60). |
group_var |
Optionally, The name of the column in meta that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared. |
measure |
The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document), "overlap" (like overlap_pct, but as the sum of overlap instead of the percentage) and the symmetrical soft cosine measure (experimental). The regular dot product (dot_product) is also supported. |
tf_idf |
If TRUE, weigh the dtm (and dtm_y) by term frequency - inverse document frequency. For more control over weighting, we recommend using quanteda's dfm_tfidf or dfm_weight on dtm and dtm_y. |
min_similarity |
A threshold for similarity. lower values are deleted. For all available similarity measures zero means no similarity. |
n_topsim |
An alternative or additional sort of threshold for similarity. Only keep the [n_topsim] highest similarity scores for x. Can return more than [n_topsim] similarity scores in the case of duplicate similarities. |
only_complete_window |
If True, only compare articles (x) of which a full window of reference articles (y) is available. Thus, for the first and last [window.size] days, there will be no results for x. |
... |
Other arguments passed to |
Value
An igraph network.
Examples
dtm = quanteda::dfm_tfidf(rnewsflow_dfm)
el = newsflow_compare(dtm, date_var='date', hour_window = c(0.1, 36))