R: Check for highly similar articles.

lnt_similarity {LexisNexisTools}

R Documentation

Check for highly similar articles.

Description

Check for highly similar articles by comparing all articles published on the same date. This function implements two measures to test if articles are almost identical. The function textstat_simil, which compares the word similarity of two given texts; and a relative modification of the generalized Levenshtein (edit) distance implementation in stringdist. The relative distance is calculated by dividing the string distance by the number of characters in the longer article (resulting in a minimum of 0 if articles are exactly alike and 1 if strings are completely different). Using both methods cancels out the disadvantages of each method: the similarity measure is fast but does not take the word order into account. Two widely different texts could, therefore, be identified as the same, if they employ the exact same vocabulary for some reason. The generalized Levenshtein distance is more accurate but is very computationally demanding, especially if more than two texts are compared at once.

Usage

lnt_similarity(
  texts,
  dates,
  LNToutput,
  IDs = NULL,
  threshold = 0.99,
  rel_dist = TRUE,
  length_diff = Inf,
  nthread = getOption("sd_num_thread"),
  max_length = Inf,
  verbose = TRUE
)

Arguments

`texts`	Provide texts to check for similarity.
`dates`	Provide corresponding dates, same length as `text`.
`LNToutput`	Alternatively to providing texts and dates individually, you can provide an LNToutput object.
`IDs`	IDs of articles.
`threshold`	At which threshold of similarity is an article considered a duplicate. Note that lower threshold values will increase the time to calculate the relative difference (as more articles are considered).
`rel_dist`	Calculate the relative Levenshtein distance between two articles if set to TRUE (can take very long). The main difference between the similarity and distance value is that the distance takes word order into account while similarity employs the bag of words approach.
`length_diff`	Before calculating the relative distance between articles, the length of the articles in characters is calculated. If the difference surpasses this value, calculation is omitted and the distance will set to NA.
`nthread`	Maximum number of threads to use (see stringdist-parallelization).
`max_length`	If the article is too long, calculation of the relative distance can cause R to crash (see https://github.com/markvanderloo/stringdist/issues/59). To prevent this you can set a maximum length (longer articles will not be evaluated).
`verbose`	A logical flag indicating whether information should be printed to the screen.

Value

A data.table consisting of information about duplicated articles. Articles with a lower similarity than the threshold will be removed, while all relative distances are still in the returned object. Before you use the duplicated information to subset your dataset, you should, therefore, filter out results with a high relative distance (e.g. larger than 0.2).

Author(s)

Johannes B. Gruber

Examples

## Not run: 
# Copy sample file to current wd
lnt_sample()

## End(Not run)

# Convert raw file to LNToutput object
LNToutput <- lnt_read(lnt_sample(copy = FALSE))

# Test similarity of articles
duplicates.df <- lnt_similarity(
  texts = LNToutput@articles$Article,
  dates = LNToutput@meta$Date,
  IDs = LNToutput@articles$ID
)

# Remove instances with a high relative distance
duplicates.df <- duplicates.df[duplicates.df$rel_dist < 0.2]

# Create three separate data.frames from cleaned LNToutput object
LNToutput <- LNToutput[!LNToutput@meta$ID %in%
  duplicates.df$ID_duplicate]
meta.df <- LNToutput@meta
articles.df <- LNToutput@articles
paragraphs.df <- LNToutput@paragraphs

[Package LexisNexisTools version 1.0.0 Index]