rm_infrequent_words {textTools} | R Documentation |
Delete rows in a text.table where the number of identical records within a group is less than a certain threshold
Description
Delete rows in a text.table where the number of identical records within a group is less than a certain threshold
Usage
rm_infrequent_words(
x,
text,
count_col_name = NULL,
group_by = c(),
min_count,
min_count_is_ratio = FALSE,
total_count_col = NULL
)
Arguments
x |
A text.table created by as.text.table(). |
text |
A string, the name of the column in x used to determine deletion of rows based on the term frequency. |
count_col_name |
A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts. |
group_by |
A vector of column names to group by. Doesn't work if the group by column is a list column. |
min_count |
A number, the minimum number of times a word must occur to keep. |
min_count_is_ratio |
TRUE/FALSE, if TRUE, implies the value passed to min_count should be considered a ratio. |
total_count_col |
Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if min_count_is_ratio is TRUE. |
Value
A text.table, with rows having a duplicate count of less than a certain threshold deleted.
Examples
rm_infrequent_words(
as.text.table(
x = as.data.table(
list(
col1 = c(
"a",
"b"
),
col2 = c(
tolower("The dog is nice because it picked up the newspaper."),
tolower("The dog is extremely nice because it does the dishes.")
)
)
),
text = "col2",
split = " "
),
text = "col2",
count_col_name = "count",
min_count = 4
)
rm_infrequent_words(
as.text.table(
x = as.data.table(
list(
col1 = c(
"a",
"b"
),
col2 = c(
tolower("The dog is nice because it picked up the
newspaper and it is the nice kind of dog."),
tolower("The dog is extremely nice because it does the dishes
and it is cool.")
)
)
),
text = "col2",
split = " "
),
text = "col2",
count_col_name = "count",
group_by = "col1",
min_count = 2
)