doc_similarity {text2map} | R Documentation |
Find a similarities between documents
Description
Given a document-term matrix (DTM) this function returns the similarities between documents using a specified method (see details). The result is a square document-by-document similarity matrix (DSM), equivalent to a weighted adjacency matrix in network analysis.
Usage
doc_similarity(x, y = NULL, method, wv = NULL)
Arguments
x |
Document-term matrix with terms as columns. |
y |
Optional second matrix (default = |
method |
Character vector indicating similarity method, including projection, cosine, wmd, and centroid (see Details). |
wv |
Matrix of word embedding vectors (a.k.a embedding model) with rows as words. Required for "wmd" and "centroid" similarities. |
Details
Document similarity methods include:
projection: finds the one-mode projection matrix from the two-mode DTM using
tcrossprod()
which measures the shared vocabulary overlapcosine: compares row vectors using cosine similarity
jaccard: compares proportion of common words to unique words in both documents
wmd: word mover's distance to compare documents (requires word embedding vectors), using linear-complexity relaxed word mover's distance
centroid: represents each document as a centroid of their respective vocabulary, then uses cosine similarity to compare centroid vectors (requires word embedding vectors)
Author(s)
Dustin Stoltz
Examples
# load example word embeddings
data(ft_wv_sample)
# load example text
data(jfk_speech)
# minimal preprocessing
jfk_speech$sentence <- tolower(jfk_speech$sentence)
jfk_speech$sentence <- gsub("[[:punct:]]+", " ", jfk_speech$sentence)
# create DTM
dtm <- dtm_builder(jfk_speech, sentence, sentence_id)
dsm_prj <- doc_similarity(dtm, method = "projection")
dsm_cos <- doc_similarity(dtm, method = "cosine")
dsm_wmd <- doc_similarity(dtm, method = "wmd", wv = ft_wv_sample)
dsm_cen <- doc_similarity(dtm, method = "centroid", wv = ft_wv_sample)