delete_duplicates {RNewsflow} | R Documentation |
Delete duplicate (or similar) documents from a document term matrix
Description
Delete duplicate (or similar) documents from a document term matrix. Duplicates are defined by: having high content similarity, occuring within a given time distance and being published by the same source.
Usage
delete_duplicates(
dtm,
date_var = NULL,
hour_window = c(-24, 24),
group_var = NULL,
measure = c("cosine", "overlap_pct"),
similarity = 1,
keep = "first",
tf_idf = FALSE,
dup_csv = NULL,
verbose = F
)
Arguments
dtm |
A quanteda dfm. |
date_var |
The name of the column in docvars(dtm) that specifies the document date. The values should be of type POSIXlt or POSIXct |
hour_window |
A vector of length 2, in which the first and second value determine the left and right side of the window, respectively. For example, c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours. |
group_var |
Optionally, column name in docvars(dtm) that specifies a group (e.g., source, sourcetype). If given, only documents within the same group will be compared. |
measure |
The measure that should be used to calculate similarity/distance/adjacency. Currently supports the symmetrical measure "cosine" (cosine similarity), and the assymetrical measures "overlap_pct" (percentage of term scores in the document that also occur in the other document). |
similarity |
A threshold for similarity. Documents of which similarity is equal or higher are deleted |
keep |
A character indicating whether to keep the 'first' or 'last' published of duplicate documents. |
tf_idf |
If TRUE, weight the dtm with tf_idf before comparing documents. The original (non-weighted) DTM is returned. |
dup_csv |
Optionally, a path for writing a csv file with the duplicates edgelist. For each duplicate pair it is noted if "from" or "to" is the duplicate, or if "both" are duplicates (of other documents) |
verbose |
If TRUE, report progress |
Details
Note that this can also be used to delete "updates" of articles (e.g., on news sites, news agencies). This should be considered if the temporal order of publications is relevant for the analysis.
Value
A dtm with the duplicate documents deleted
Examples
## example with very low similarity threshold (normally not recommended!)
dtm2 = delete_duplicates(rnewsflow_dfm, similarity = 0.5, keep='first', tf_idf = TRUE)