keyword_extract {akc} | R Documentation |
Extract keywords from raw text
Description
When we have raw text like abstract or article but not keywords, we might prefer extracting
keywords first. The least prerequisite data to be provided are a data.frame with document id and raw text,
and a user defined dictionary should be provided. One could use make_dict
function to construct his(her)
own dictionary with a character vector containing the vocabularies. If the dictionary is not provided,
the function would return all the ngram tokens without filtering (not recommended).
Usage
keyword_extract(
dt,
id = "id",
text,
dict = NULL,
stopword = NULL,
n_max = 4,
n_min = 1
)
Arguments
dt |
A data.frame containing at least two columns with document ID and text strings for extraction. |
id |
Quoted characters specifying the column name of document ID.Default uses "id". |
text |
Quoted characters specifying the column name of raw text for extraction. |
dict |
A data.table with two columns,namely "id" and "keyword"(set as key).
This should be exported by |
stopword |
A vector containing the stop words to be used. Default uses |
n_max |
The number of words in the n-gram. This must be an integer greater than or equal to 1. Default uses 4. |
n_min |
This must be an integer greater than or equal to 1, and less than or equal to n_max. Default uses 1. |
Details
In the procedure of keyword extraction from akc,first the raw text would be split
into independent clause (namely split by puctuations of [,;!?.]
). Then the ngrams of the
clauses would be extracted. Finally, the phrases represented by ngrams should be in the dictionary
created by the user (using make_dict
).The user could also specify the n of ngrams.
This function could take some time if the sample size is large, it is suggested to use system.time to do some test first. Nonetheless, it has been optimized by data.table codes already and has good performance for big data.
Value
A data.frame(tibble) with two columns, namely document ID and extracted keyword.
See Also
Examples
library(akc)
library(dplyr)
bibli_data_table %>%
keyword_clean(id = "id",keyword = "keyword") %>%
pull(keyword) %>%
make_dict -> my_dict
tidytext::stop_words %>%
pull(word) %>%
unique() -> my_stopword
bibli_data_table %>%
keyword_extract(id = "id",text = "abstract",
dict = my_dict,stopword = my_stopword)