R: Extract keywords from raw text

keyword_extract {akc}

R Documentation

Extract keywords from raw text

Description

When we have raw text like abstract or article but not keywords, we might prefer extracting keywords first. The least prerequisite data to be provided are a data.frame with document id and raw text, and a user defined dictionary should be provided. One could use make_dict function to construct his(her) own dictionary with a character vector containing the vocabularies. If the dictionary is not provided, the function would return all the ngram tokens without filtering (not recommended).

Usage

keyword_extract(
  dt,
  id = "id",
  text,
  dict = NULL,
  stopword = NULL,
  n_max = 4,
  n_min = 1
)

Arguments

`dt`	A data.frame containing at least two columns with document ID and text strings for extraction.
`id`	Quoted characters specifying the column name of document ID.Default uses "id".
`text`	Quoted characters specifying the column name of raw text for extraction.
`dict`	A data.table with two columns,namely "id" and "keyword"(set as key). This should be exported by `make_dict` function. The default uses `NULL`, which means the output keywords are not filtered by the dictionary (usually not recommended).
`stopword`	A vector containing the stop words to be used. Default uses `NULL`.
`n_max`	The number of words in the n-gram. This must be an integer greater than or equal to 1. Default uses 4.
`n_min`	This must be an integer greater than or equal to 1, and less than or equal to n_max. Default uses 1.

Details

In the procedure of keyword extraction from akc,first the raw text would be split into independent clause (namely split by puctuations of [,;!?.]). Then the ngrams of the clauses would be extracted. Finally, the phrases represented by ngrams should be in the dictionary created by the user (using make_dict).The user could also specify the n of ngrams.

This function could take some time if the sample size is large, it is suggested to use system.time to do some test first. Nonetheless, it has been optimized by data.table codes already and has good performance for big data.

Value

A data.frame(tibble) with two columns, namely document ID and extracted keyword.

Examples


 library(akc)
 library(dplyr)

  bibli_data_table %>%
    keyword_clean(id = "id",keyword = "keyword") %>%
    pull(keyword) %>%
    make_dict -> my_dict

  tidytext::stop_words %>%
    pull(word) %>%
    unique() -> my_stopword

 
  bibli_data_table %>%
    keyword_extract(id = "id",text = "abstract",
    dict = my_dict,stopword = my_stopword)

[Package akc version 0.9.9 Index]