docfreq_filter {corpustools}R Documentation

Support function for subset method

Description

Support function to enable subsetting by document frequency stats of a given feature. Should only be used within the tCorpus subset method, or any tCorpus method that supports a subset argument.

Usage

docfreq_filter(
  x,
  min = -Inf,
  max = Inf,
  top = NULL,
  bottom = NULL,
  doc_id = parent.frame()$doc_id
)

Arguments

x

the name of the feature column. Can be given as a call or a string.

min

A number, setting the minimum document frequency value

max

A number, setting the maximum document frequency value

top

A number. If given, only the top x features with the highest document frequency are TRUE

bottom

A number. If given, only the bottom x features with the highest document frequency are TRUE

doc_id

Added for reference, but should not be used. Automatically takes doc_id from tCorpus if the docfreq_filter function is used within the subset method.

Examples

tc = create_tcorpus(c('a a a b b', 'a a c c'))

tc$tokens
tc$subset(subset = docfreq_filter(token, min=2))
tc$tokens

[Package corpustools version 0.4.10 Index]