rm_frequent_words {textTools} | R Documentation |
Delete rows in a text.table where the number of identical records within a group is more than a certain threshold
Description
Delete rows in a text.table where the number of identical records within a group is more than a certain threshold
Usage
rm_frequent_words(
x,
text,
count_col_name = NULL,
group_by = c(),
max_count,
max_count_is_ratio = FALSE,
total_count_col = NULL
)
Arguments
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x used to determine deletion of rows based on the term frequency. |
count_col_name |
A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts. |
group_by |
A vector of column names to group by. Doesn't work if the group by column is a list column. |
max_count |
A number, the maximum number of times a word can occur to keep. |
max_count_is_ratio |
TRUE/FALSE, if TRUE, implies the value passed to max_count should be considered a ratio. |
total_count_col |
Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if max_count_is_ratio is TRUE. |
Value
A text.table, with rows having a duplicate count over a certain threshold deleted.
Examples
rm_frequent_words(
as.text.table(
x = as.data.table(
list(
col1 = c(
"a",
"b"
),
col2 = c(
tolower("The dog is nice because it picked up the newspaper."),
tolower("The dog is extremely nice because it does the dishes.")
)
)
),
text = "col2",
split = " "
),
text = "col2",
count_col_name = "count",
max_count = 1
)