tCorpus$feature_subset {corpustools}R Documentation

Filter features

Description

Similar to using tCorpus$subset, but instead of deleting rows it only sets rows for a specified feature to NA. This can be very convenient, because it enables only a selection of features to be used in an analysis (e.g. a topic model) but maintaining the context of the full article, so that results can be viewed in this context (e.g. a topic browser).

Just as in subset, it is easy to use objects and functions in the filter, including the special functions for using term frequency statistics (see documentation for tCorpus$subset).

Usage:

## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).

feature_subset(column, new_column, subset)

Arguments

column

the column containing the feature to be used as the input

subset

logical expression indicating rows to keep in the tokens data. i.e. rows for which the logical expression is FALSE will be set to NA.

new_column

the column to save the filtered feature. Can be a new column or overwrite an existing one.

min_freq

an integer, specifying minimum token frequency.

min_docfreq

an integer, specifying minimum document frequency.

max_freq

an integer, specifying minimum token frequency.

max_docfreq

an integer, specifying minimum document frequency.

min_char

an integer, specifying minimum characters in a token

max_char

an integer, specifying maximum characters in a token

Examples

tc = create_tcorpus('a a a a b b b c c')

tc$feature_subset('token', 'tokens_subset1', subset = token_id < 5)
tc$feature_subset('token', 'tokens_subset2', subset = freq_filter(token, min = 3))

tc$tokens

[Package corpustools version 0.5.1 Index]