keywords_rake {udpipe}R Documentation

Keyword identification using Rapid Automatic Keyword Extraction (RAKE)

Description

RAKE is a basic algorithm which tries to identify keywords in text. Keywords are defined as a sequence of words following one another.
The algorithm goes as follows.

The resulting keywords are returned as a data.frame together with their RAKE score.

Usage

keywords_rake(
  x,
  term,
  group,
  relevant = rep(TRUE, nrow(x)),
  ngram_max = 2,
  n_min = 2,
  sep = " "
)

Arguments

x

a data.frame with one row per term as returned by as.data.frame(udpipe_annotate(...))

term

character string with a column in the data frame x, containing 1 term per row. To be used if x is a data.frame.

group

a character vector with 1 or several columns from x which indicates for example a document id or a sentence id. Keywords will be computed within this group in order not to find keywords across sentences or documents for example.

relevant

a logical vector of the same length as nrow(x), indicating if the word in the corresponding row of x is relevant or not. This can be used to exclude stopwords from the keywords calculation or for selecting only nouns and adjectives to find keywords (for example based on the Parts of Speech upos output from udpipe_annotate).

ngram_max

integer indicating the maximum number of words that there should be in each keyword

n_min

integer indicating the frequency of how many times a keywords should at least occur in the data in order to be returned. Defaults to 2.

sep

character string with the separator which will be used to paste together the terms which define the keywords. Defaults to a space: ' '.

Value

a data.frame with columns keyword, ngram and rake which is ordered from low to high rake

References

Rose, Stuart & Engel, Dave & Cramer, Nick & Cowley, Wendy. (2010). Automatic Keyword Extraction from Individual Documents. Text Mining: Applications and Theory. 1 - 20. 10.1002/9780470689646.ch1.

Examples

data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keywords <- keywords_rake(x = x, term = "lemma", group = "doc_id", 
                          relevant = x$xpos %in% c("NN", "JJ"))
head(keywords)

x <- subset(brussels_reviews_anno, language == "fr")
keywords <- keywords_rake(x = x, term = "lemma", group = c("doc_id", "sentence_id"), 
                          relevant = x$xpos %in% c("NN", "JJ"), 
                          ngram_max = 10, n_min = 2, sep = "-")
head(keywords)

[Package udpipe version 0.8.11 Index]