R: Automatic keyword cleaning and transfer to tidy format

keyword_clean {akc}

R Documentation

Automatic keyword cleaning and transfer to tidy format

Description

Carry out several keyword cleaning processes automatically and return a tidy table with document ID and keywords.

Usage

keyword_clean(
  df,
  id = "id",
  keyword = "keyword",
  sep = ";",
  rmParentheses = TRUE,
  rmNumber = TRUE,
  lemmatize = FALSE,
  lemmatize_dict = NULL
)

Arguments

`df`	A data.frame containing at least two columns with document ID and keyword strings with separators.
`id`	Quoted characters specifying the column name of document ID.Default uses "id".
`keyword`	Quoted characters specifying the column name of keywords.Default uses "keyword".
`sep`	Separator(s) of keywords. Default uses ";".
`rmParentheses`	Remove the contents in the parentheses (including the parentheses) or not. Default uses TRUE.
`rmNumber`	Remove the pure number sequence or no. Default uses TRUE.
`lemmatize`	Lemmatize the keywords or not. Lemmatization is supported by 'lemmatize_strings' function in 'textstem' package.Default uses FALSE.
`lemmatize_dict`	A dictionary of base terms and lemmas to use for replacement. Only used when the lemmatize parameter is `TRUE`. The first column should be the full word form in lower case while the second column is the corresponding replacement lemma. Default uses `NULL`, this would apply the default dictionary used in `lemmatize_strings` function.

Details

The entire cleaning processes include: 1.Split the text with separators; 2.Remove the contents in the parentheses (including the parentheses); 3.Remove white spaces from start and end of string and reduces repeated white spaces inside a string; 4.Remove all the null character string and pure number sequences; 5.Convert all letters to lower case; 6.Lemmatization. Some of the procedures could be suppressed or activated with parameter adjustments. Default setting did not use lemmatization, it is suggested to use keyword_merge to merge the keywords afterward.

Value

A tbl with two columns, namely document ID and cleaned keywords.

Examples

library(akc)

bibli_data_table

bibli_data_table %>%
  keyword_clean(id = "id",keyword = "keyword")

[Package akc version 0.9.9 Index]