textrank_sentences {textrank} | R Documentation |
Textrank - extract relevant sentences
Description
The textrank algorithm is a technique to rank sentences in order of importance.
In order to find relevant sentences, the textrank algorithm needs 2 inputs:
a data.frame (data
) with sentences and a data.frame (terminology
)
containing tokens which are part of each sentence.
Based on these 2 datasets, it calculates the pairwise distance between each sentence by computing
how many terms are overlapping (Jaccard distance, implemented in textrank_jaccard
).
These pairwise distances among the sentences are next passed on to Google's pagerank algorithm
to identify the most relevant sentences.
If data
contains many sentences, it makes sense not to compute all pairwise sentence distances but instead limiting
the calculation of the Jaccard distance to only sentence combinations which are limited by the Minhash algorithm.
This is implemented in textrank_candidates_lsh
and an example is show below.
Usage
textrank_sentences(
data,
terminology,
textrank_dist = textrank_jaccard,
textrank_candidates = textrank_candidates_all(data$textrank_id),
max = 1000,
options_pagerank = list(directed = FALSE),
...
)
Arguments
data |
a data.frame with 1 row per sentence where the first column is an identifier of a sentence (e.g. textrank_id) and the second column is the raw sentence. See the example. |
terminology |
a data.frame with with one row per token indicating which token is part of each sentence.
The first column in this data.frame is the identifier which corresponds to the first column of |
textrank_dist |
a function which calculates the distance between 2 sentences which are represented by a vectors of tokens.
The first 2 arguments of the function are the tokens in sentence1 and sentence2.
The function should return a numeric value of length one. The larger the value,
the larger the connection between the 2 vectors indicating more strength. Defaults to the jaccard distance ( |
textrank_candidates |
a data.frame of candidate sentence to sentence comparisons with columns textrank_id_1 and textrank_id_2
indicating for which combination of sentences we want to compute the Jaccard distance or the distance function as provided in |
max |
integer indicating to reduce the number of sentence to sentence combinations to compute.
In case provided, we take only this max amount of rows from |
options_pagerank |
a list of arguments passed on to |
... |
arguments passed on to |
Value
an object of class textrank_sentences which is a list with elements:
sentences: a data.frame with columns textrank_id, sentence and textrank where the textrank is the Google Pagerank importance metric of the sentence
sentences_dist: a data.frame with columns textrank_id_1, textrank_id_2 (the sentence id) and weight which is the result of the computed distance between the 2 sentences
pagerank: the result of a call to
page_rank
See Also
page_rank
, textrank_candidates_all
, textrank_candidates_lsh
, textrank_jaccard
Examples
library(udpipe)
data(joboffer)
head(joboffer)
joboffer$textrank_id <- unique_identifier(joboffer, c("doc_id", "paragraph_id", "sentence_id"))
sentences <- unique(joboffer[, c("textrank_id", "sentence")])
cat(sentences$sentence)
terminology <- subset(joboffer, upos %in% c("NOUN", "ADJ"), select = c("textrank_id", "lemma"))
head(terminology)
## Textrank for finding the most relevant sentences
tr <- textrank_sentences(data = sentences, terminology = terminology)
summary(tr, n = 2)
summary(tr, n = 5, keep.sentence.order = TRUE)
## Not run:
## Using minhash to reduce sentence combinations - relevant if you have a lot of sentences
library(textreuse)
minhash <- minhash_generator(n = 1000, seed = 123456789)
candidates <- textrank_candidates_lsh(x = terminology$lemma, sentence_id = terminology$textrank_id,
minhashFUN = minhash, bands = 500)
tr <- textrank_sentences(data = sentences, terminology = terminology,
textrank_candidates = candidates)
summary(tr, n = 2)
## End(Not run)
## You can also reduce the number of sentence combinations by sampling
tr <- textrank_sentences(data = sentences, terminology = terminology, max = 100)
tr
summary(tr, n = 2)