keyword_clean {akc} | R Documentation |
Automatic keyword cleaning and transfer to tidy format
Description
Carry out several keyword cleaning processes automatically and return a tidy table with document ID and keywords.
Usage
keyword_clean(
df,
id = "id",
keyword = "keyword",
sep = ";",
rmParentheses = TRUE,
rmNumber = TRUE,
lemmatize = FALSE,
lemmatize_dict = NULL
)
Arguments
df |
A data.frame containing at least two columns with document ID and keyword strings with separators. |
id |
Quoted characters specifying the column name of document ID.Default uses "id". |
keyword |
Quoted characters specifying the column name of keywords.Default uses "keyword". |
sep |
Separator(s) of keywords. Default uses ";". |
rmParentheses |
Remove the contents in the parentheses (including the parentheses) or not. Default uses TRUE. |
rmNumber |
Remove the pure number sequence or no. Default uses TRUE. |
lemmatize |
Lemmatize the keywords or not. Lemmatization is supported by 'lemmatize_strings' function in 'textstem' package.Default uses FALSE. |
lemmatize_dict |
A dictionary of base terms and lemmas to use for replacement.
Only used when the lemmatize parameter is |
Details
The entire cleaning processes include:
1.Split the text with separators;
2.Remove the contents in the parentheses (including the parentheses);
3.Remove white spaces from start and end of string and reduces repeated white spaces inside a string;
4.Remove all the null character string and pure number sequences;
5.Convert all letters to lower case;
6.Lemmatization.
Some of the procedures could be suppressed or activated with parameter adjustments.
Default setting did not use lemmatization, it is suggested to use keyword_merge
to
merge the keywords afterward.
Value
A tbl with two columns, namely document ID and cleaned keywords.
See Also
Examples
library(akc)
bibli_data_table
bibli_data_table %>%
keyword_clean(id = "id",keyword = "keyword")