R: Use locality-sensitive hashing to get combinations of...

textrank_candidates_lsh {textrank}

R Documentation

Use locality-sensitive hashing to get combinations of sentences which contain words which are in the same minhash bucket

Description

This functionality is usefull if there are a lot of sentences and most of the sentences have no overlapping words in there. In order not to compute the jaccard distance among all possible combinations of sentences as is done by using textrank_candidates_all, we can reduce the combinations of sentences by using the Minhash algorithm. This function sets up the combinations of sentences which are in the same Minhash bucket.

Usage

textrank_candidates_lsh(x, sentence_id, minhashFUN, bands)

Arguments

`x`	a character vector of words or terms
`sentence_id`	a character vector of identifiers of sentences where the words/terms provided in `x` are part of the sentence. The length of `sentence_id` should be the same length of `x`
`minhashFUN`	a function which returns a minhash of a character vector. See the examples or look at `minhash_generator`
`bands`	integer indicating to break down the minhashes in `bands` number of bands. Mark that the number of minhash signatures should always be a multiple of the number of local sensitive hashing bands. See the example

Value

a data.frame with 2 columns textrank_id_1 and textrank_id_2 containing identifiers of sentences sentence_id which contained terms in the same minhash bucket. This data.frame can be used as input in the textrank_sentences algorithm.

Examples


library(textreuse)
library(udpipe)
lsh_probability(h = 1000, b = 500, s = 0.1) # A 10 percent Jaccard overlap will be detected well

minhash <- minhash_generator(n = 1000, seed = 123456789)

data(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
                                      minhashFUN = minhash, bands = 500)
head(candidates)
tr <- textrank_sentences(data = sentences, terminology = terminology,
                         textrank_candidates = candidates)
summary(tr, n = 2)