R: Extractive text summarization with LexRank

lexRank {lexRankr}

R Documentation

Extractive text summarization with LexRank

Description

Compute LexRanks from a vector of documents using the page rank algorithm or degree centrality the methods used to compute lexRank are discussed in "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization."

Usage

lexRank(text, docId = "create", threshold = 0.2, n = 3,
  returnTies = TRUE, usePageRank = TRUE, damping = 0.85,
  continuous = FALSE, sentencesAsDocs = FALSE, removePunc = TRUE,
  removeNum = TRUE, toLower = TRUE, stemWords = TRUE,
  rmStopWords = TRUE, Verbose = TRUE)

Arguments

`text`	A character vector of documents to be cleaned and processed by the LexRank algorithm
`docId`	A vector of document IDs with length equal to the length of `text`. If `docId == "create"` then doc IDs will be created as an index from 1 to `n`, where `n` is the length of `text`.
`threshold`	The minimum simil value a sentence pair must have to be represented in the graph where lexRank is calculated.
`n`	The number of sentences to return as the extractive summary. The function will return the top `n` lexRanked sentences. See `returnTies` for handling ties in lexRank.
`returnTies`	`TRUE` or `FALSE` indicating whether or not to return greater than `n` sentence IDs if there is a tie in lexRank. If `TRUE`, the returned number of sentences will not be limited to `n`, but rather will return every sentence with a top 3 score. If `FALSE`, the returned number of sentences will be `<=n`. Defaults to `TRUE`.
`usePageRank`	`TRUE` or `FALSE` indicating whether or not to use the page rank algorithm for ranking sentences. If `FALSE`, a sentences unweighted centrality will be used as the rank. Defaults to `TRUE`.
`damping`	The damping factor to be passed to page rank algorithm. Ignored if `usePageRank` is `FALSE`.
`continuous`	`TRUE` or `FALSE` indicating whether or not to use continuous LexRank. Only applies if `usePageRank==TRUE`. If `TRUE`, `threshold` will be ignored and lexRank will be computed using a weighted graph representation of the sentences. Defaults to `FALSE`.
`sentencesAsDocs`	`TRUE` or `FALSE`, indicating whether or not to treat sentences as documents when calculating tfidf scores for similarity. If `TRUE`, inverse document frequency will be calculated as inverse sentence frequency (useful for single document extractive summarization).
`removePunc`	`TRUE` or `FALSE` indicating whether or not to remove punctuation from text while tokenizing. If `TRUE`, punctuation will be removed. Defaults to `TRUE`.
`removeNum`	`TRUE` or `FALSE` indicating whether or not to remove numbers from text while tokenizing. If `TRUE`, numbers will be removed. Defaults to `TRUE`.
`toLower`	`TRUE` or `FALSE` indicating whether or not to coerce all of text to lowercase while tokenizing. If `TRUE`, `text` will be coerced to lowercase. Defaults to `TRUE`.
`stemWords`	`TRUE` or `FALSE` indicating whether or not to stem resulting tokens. If `TRUE`, the outputted tokens will be tokenized using `SnowballC::wordStem()`. Defaults to `TRUE`.
`rmStopWords`	`TRUE`, `FALSE`, or character vector of stopwords to remove from tokens. If `TRUE`, words in `lexRankr::smart_stopwords` will be removed prior to stemming. If `FALSE`, no stopword removal will occur. If a character vector is passed, this vector will be used as the list of stopwords to be removed. Defaults to `TRUE`.
`Verbose`	`TRUE` or `FALSE` indicating whether or not to `cat` progress messages to the console while running. Defaults to `TRUE`.

Value

A 2 column dataframe with columns sentenceId and value. sentence contains the ids of the top n sentences in descending order by value. value contains page rank score (if usePageRank==TRUE) or degree centrality (if usePageRank==FALSE).

References

http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume22/erkan04a-html/erkan04a.html

Examples

lexRank(c("This is a test.","Tests are fun.",
"Do you think the exam will be hard?","Is an exam the same as a test?",
"How many questions are going to be on the exam?"))

[Package lexRankr version 0.5.2 Index]