tCorpus$feature_subset {corpustools} | R Documentation |
Filter features
Description
Similar to using tCorpus$subset, but instead of deleting rows it only sets rows for a specified feature to NA. This can be very convenient, because it enables only a selection of features to be used in an analysis (e.g. a topic model) but maintaining the context of the full article, so that results can be viewed in this context (e.g. a topic browser).
Just as in subset, it is easy to use objects and functions in the filter, including the special functions for using term frequency statistics (see documentation for tCorpus$subset).
Usage:
## R6 method for class tCorpus. Use as tc$method (where tc is a tCorpus object).
feature_subset(column, new_column, subset)
Arguments
column |
the column containing the feature to be used as the input |
subset |
logical expression indicating rows to keep in the tokens data. i.e. rows for which the logical expression is FALSE will be set to NA. |
new_column |
the column to save the filtered feature. Can be a new column or overwrite an existing one. |
min_freq |
an integer, specifying minimum token frequency. |
min_docfreq |
an integer, specifying minimum document frequency. |
max_freq |
an integer, specifying minimum token frequency. |
max_docfreq |
an integer, specifying minimum document frequency. |
min_char |
an integer, specifying minimum characters in a token |
max_char |
an integer, specifying maximum characters in a token |
Examples
tc = create_tcorpus('a a a a b b b c c')
tc$feature_subset('token', 'tokens_subset1', subset = token_id < 5)
tc$feature_subset('token', 'tokens_subset2', subset = freq_filter(token, min = 3))
tc$tokens