textstat_frequency {quanteda.textstats} | R Documentation |
Tabulate feature frequencies
Description
Produces counts and document frequencies summaries of the features in a dfm, optionally grouped by a docvars variable or other supplied grouping variable.
Usage
textstat_frequency(
x,
n = NULL,
groups = NULL,
ties_method = c("min", "average", "first", "random", "max", "dense"),
...
)
Arguments
x |
a dfm object |
n |
(optional) integer specifying the top |
groups |
grouping variable for sampling, equal in length to the number
of documents. This will be evaluated in the docvars data.frame, so that
docvars may be referred to by name without quoting. This also changes
previous behaviours for |
ties_method |
character string specifying how ties are treated. See
|
... |
additional arguments passed to |
Value
a data.frame containing the following variables:
feature
(character) the feature
frequency
count of the feature
rank
rank of the feature, where 1 indicates the greatest frequency
docfreq
document frequency of the feature, as a count (the number of documents in which this feature occurred at least once)
docfreq
document frequency of the feature, as a count
group
(only if
groups
is specified) the label of the group. If the features have been grouped, then all counts, ranks, and document frequencies are within group. If groups is not specified, thegroup
column is omitted from the returned data.frame.
textstat_frequency
returns a data.frame of features and
their term and document frequencies within groups.
Examples
library("quanteda")
set.seed(20)
dfmat1 <- dfm(tokens(c("a a b b c d", "a d d d", "a a a")))
textstat_frequency(dfmat1)
textstat_frequency(dfmat1, groups = c("one", "two", "one"), ties_method = "first")
textstat_frequency(dfmat1, groups = c("one", "two", "one"), ties_method = "average")
dfmat2 <- corpus_subset(data_corpus_inaugural, President == "Obama") %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm()
tstat1 <- textstat_frequency(dfmat2)
head(tstat1, 10)
dfmat3 <- head(data_corpus_inaugural) %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm()
textstat_frequency(dfmat3, n = 2, groups = President)
## Not run:
# plot 20 most frequent words
library("ggplot2")
ggplot(tstat1[1:20, ], aes(x = reorder(feature, frequency), y = frequency)) +
geom_point() +
coord_flip() +
labs(x = NULL, y = "Frequency")
# plot relative frequencies by group
dfmat3 <- data_corpus_inaugural %>%
corpus_subset(Year > 2000) %>%
tokens(remove_punct = TRUE) %>%
tokens_remove(stopwords("en")) %>%
dfm() %>%
dfm_group(groups = President) %>%
dfm_weight(scheme = "prop")
# calculate relative frequency by president
tstat2 <- textstat_frequency(dfmat3, n = 10, groups = President)
# plot frequencies
ggplot(data = tstat2, aes(x = factor(nrow(tstat2):1), y = frequency)) +
geom_point() +
facet_wrap(~ group, scales = "free") +
coord_flip() +
scale_x_discrete(breaks = nrow(tstat2):1,
labels = tstat2$feature) +
labs(x = NULL, y = "Relative frequency")
## End(Not run)