R: Calculate the similarity of documents

compare_documents {corpustools}

R Documentation

Calculate the similarity of documents

Description

Calculate the similarity of documents

Usage

compare_documents(
  tc,
  feature = "token",
  date_col = NULL,
  meta_cols = NULL,
  hour_window = c(24),
  measure = c("cosine", "overlap_pct"),
  min_similarity = 0,
  weight = c("norm_tfidf", "tfidf", "termfreq", "docfreq"),
  ngrams = NA,
  from_subset = NULL,
  to_subset = NULL,
  return_igraph = T,
  verbose = T
)

Arguments

`tc`	A tCorpus
`feature`	the column name of the feature that is to be used for the comparison.
`date_col`	a date with time in POSIXct. If given together with hour_window, only documents within the given hour_window will be compared.
`meta_cols`	a character vector with columns in the meta data / docvars. If given, only documents for which these values are identical are compared
`hour_window`	A vector of length 1 or 2. If length is 1, the same value is used for the left and right side of the window. If length is 2, the first and second value determine the left and right side. For example, the value 12 will compare each document to all documents between the previous and next 12 hours, and c(-10, 36) will compare each document to all documents between the previous 10 and the next 36 hours.
`measure`	the similarity measure. Currently supports cosine similarity (symmetric) and overlap_pct (asymmetric)
`min_similarity`	A threshold for the similarity score
`weight`	a weighting scheme for the document-term matrix. Default is term-frequency inverse document frequency with normalized rows (document length).
`ngrams`	an integer. If given, ngrams of this length are used
`from_subset`	An expression to select a subset. If given, only this subset will be compared to other documents
`to_subset`	An expression to select a subset. If given, documents are only compared to this subset
`return_igraph`	If TRUE, return as an igraph network. Otherwise, return as a list with the edgelist and meta data.
`verbose`	If TRUE, report progress

Value

An igraph graph in which nodes are documents and edges represent similarity scores

Examples

d = data.frame(text = c('a b c d e',
                        'e f g h i j k',
                        'a b c'),
               date = as.POSIXct(c('2010-01-01','2010-01-01','2012-01-01')))
tc = create_tcorpus(d)

g = compare_documents(tc)
igraph::get.data.frame(g)

g = compare_documents(tc, measure = 'overlap_pct')
igraph::get.data.frame(g)

g = compare_documents(tc, date_col = 'date', hour_window = c(0,36))
igraph::get.data.frame(g)

[Package corpustools version 0.5.1 Index]