textrank_candidates_lsh {textrank} | R Documentation |
Use locality-sensitive hashing to get combinations of sentences which contain words which are in the same minhash bucket
Description
This functionality is usefull if there are a lot of sentences and most of the sentences have no overlapping
words in there. In order not to compute the jaccard distance among all possible combinations of sentences as is
done by using textrank_candidates_all
, we can reduce the combinations of sentences by using the Minhash algorithm.
This function sets up the combinations of sentences which are in the same Minhash bucket.
Usage
textrank_candidates_lsh(x, sentence_id, minhashFUN, bands)
Arguments
x |
a character vector of words or terms |
sentence_id |
a character vector of identifiers of sentences where the words/terms provided in |
minhashFUN |
a function which returns a minhash of a character vector. See the examples or look at |
bands |
integer indicating to break down the minhashes in |
Value
a data.frame with 2 columns textrank_id_1 and textrank_id_2 containing identifiers of sentences sentence_id
which contained terms in the same minhash bucket.
This data.frame can be used as input in the textrank_sentences
algorithm.
See Also
Examples
library(textreuse)
library(udpipe)
lsh_probability(h = 1000, b = 500, s = 0.1) # A 10 percent Jaccard overlap will be detected well
minhash <- minhash_generator(n = 1000, seed = 123456789)
data(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
minhashFUN = minhash, bands = 500)
head(candidates)
tr <- textrank_sentences(data = sentences, terminology = terminology,
textrank_candidates = candidates)
summary(tr, n = 2)