lexRank {lexRankr} | R Documentation |
Extractive text summarization with LexRank
Description
Compute LexRanks from a vector of documents using the page rank algorithm or degree centrality the methods used to compute lexRank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization."
Usage
lexRank(text, docId = "create", threshold = 0.2, n = 3,
returnTies = TRUE, usePageRank = TRUE, damping = 0.85,
continuous = FALSE, sentencesAsDocs = FALSE, removePunc = TRUE,
removeNum = TRUE, toLower = TRUE, stemWords = TRUE,
rmStopWords = TRUE, Verbose = TRUE)
Arguments
text |
A character vector of documents to be cleaned and processed by the LexRank algorithm
|
docId |
A vector of document IDs with length equal to the length of text . If docId == "create" then doc IDs will be created as an index from 1 to n , where n is the length of text .
|
threshold |
The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated.
|
n |
The number of sentences to return as the extractive summary. The function will return the top n lexRanked sentences. See returnTies for handling ties in lexRank.
|
returnTies |
TRUE or FALSE indicating whether or not to return greater than n sentence IDs if there is a tie in lexRank. If TRUE , the returned number of sentences will not be limited to n , but rather will return every sentence with a top 3 score. If FALSE , the returned number of sentences will be <=n . Defaults to TRUE .
|
|
TRUE or FALSE indicating whether or not to use the page rank algorithm for ranking sentences. If FALSE , a sentences unweighted centrality will be used as the rank. Defaults to TRUE .
|
damping |
The damping factor to be passed to page rank algorithm. Ignored if usePageRank is FALSE .
|
continuous |
TRUE or FALSE indicating whether or not to use continuous LexRank. Only applies if usePageRank==TRUE . If TRUE , threshold will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to FALSE .
|
sentencesAsDocs |
TRUE or FALSE , indicating whether or not to treat sentences as documents when calculating tfidf scores for similarity. If TRUE , inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization).
|
removePunc |
TRUE or FALSE indicating whether or not to remove punctuation from text while tokenizing. If TRUE , punctuation will be removed. Defaults to TRUE .
|
removeNum |
TRUE or FALSE indicating whether or not to remove numbers from text while tokenizing. If TRUE , numbers will be removed. Defaults to TRUE .
|
toLower |
TRUE or FALSE indicating whether or not to coerce all of text to lowercase while tokenizing. If TRUE , text will be coerced to lowercase. Defaults to TRUE .
|
stemWords |
TRUE or FALSE indicating whether or not to stem resulting tokens. If TRUE , the outputted tokens will be tokenized using SnowballC::wordStem() . Defaults to TRUE .
|
rmStopWords |
TRUE , FALSE , or character vector of stopwords to remove from tokens. If TRUE , words in lexRankr::smart_stopwords will be removed prior to stemming. If FALSE , no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to TRUE .
|
Verbose |
TRUE or FALSE indicating whether or not to cat progress messages to the console while running. Defaults to TRUE .
|
Value
A 2 column dataframe with columns sentenceId
and value
. sentence
contains the ids of the top n
sentences in descending order by value
. value
contains page rank score (if usePageRank==TRUE
) or degree centrality (if usePageRank==FALSE
).
References
http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html
Examples
lexRank(c("This is a test.","Tests are fun.",
"Do you think the exam will be hard?","Is an exam the same as a test?",
"How many questions are going to be on the exam?"))
[Package
lexRankr version 0.5.2
Index]