R: Delete rows in a text.table where the number of identical...

rm_frequent_words {textTools}

R Documentation

Delete rows in a text.table where the number of identical records within a group is more than a certain threshold

Description

Delete rows in a text.table where the number of identical records within a group is more than a certain threshold

Usage

rm_frequent_words(
  x,
  text,
  count_col_name = NULL,
  group_by = c(),
  max_count,
  max_count_is_ratio = FALSE,
  total_count_col = NULL
)

Arguments

`x`	A text.table created by as.text.table().
`text`	A string, the name of the column in x used to determine deletion of rows based on the term frequency.
`count_col_name`	A string, the name to assign to the new column containing the count of each word. If NULL, does not return the counts.
`group_by`	A vector of column names to group by. Doesn't work if the group by column is a list column.
`max_count`	A number, the maximum number of times a word can occur to keep.
`max_count_is_ratio`	TRUE/FALSE, if TRUE, implies the value passed to max_count should be considered a ratio.
`total_count_col`	Name of the column containing the denominator (likely total count of records within a group) to use to calculate the ratio of a word count vs total if max_count_is_ratio is TRUE.

Value

A text.table, with rows having a duplicate count over a certain threshold deleted.

Examples

rm_frequent_words(
as.text.table(
  x = as.data.table(
    list(
      col1 = c(
        "a",
        "b"
      ),
      col2 = c(
        tolower("The dog is nice because it picked up the newspaper."),
        tolower("The dog is extremely nice because it does the dishes.")
      )
    )
  ),
  text = "col2",
  split = " "
),
text = "col2",
count_col_name = "count",
max_count = 1
)

[Package textTools version 0.1.0 Index]