textmodel_wordmap {wordmap} | R Documentation |
A model for multinomial feature extraction and document classification
Description
Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.
Usage
textmodel_wordmap(
x,
y,
label = c("all", "max"),
smooth = 1,
boolean = FALSE,
drop_label = TRUE,
verbose = quanteda_options("verbose"),
entropy = c("none", "global", "local", "average"),
...
)
Arguments
x |
a dfm or fcm created by |
y |
a dfm or a sparse matrix that record class membership of the
documents. It can be created applying |
label |
if "max", uses only labels for the maximum value in each row of
|
smooth |
a value added to the frequency of words to smooth likelihood ratios. |
boolean |
if |
drop_label |
if |
verbose |
if |
entropy |
[experimental] the scheme to compute the entropy to
regularize likelihood ratios. The entropy of features are computed over
labels if |
... |
additional arguments passed to internal functions. |
Details
Wordmap learns association between words and classes as likelihood
ratios based on the features in x
and the labels in y
. The large
likelihood ratios tend to concentrate to a small number of features but the
entropy of their frequencies over labels or documents helps to disperse the
distribution.
Value
Returns a fitted textmodel_wordmap object with the following elements:
model |
a matrix that records the association between classes and features. |
data |
the original input of |
feature |
the feature set in the model. |
concatenator |
the
concatenator in |
entropy |
the type of entropy weights used. |
boolean |
the use of the Boolean transformation of |
call |
the command used to execute the function. |
version |
the version of the wordmap package. |
References
Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.
Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.
Examples
require(quanteda)
# split into sentences
corp <- corpus_reshape(data_corpus_ungd2017)
# tokenize
toks <- tokens(corp, remove_punct = TRUE) %>%
tokens_remove(stopwords("en"))
# apply seed dictionary
toks_dict <- tokens_lookup(toks, data_dictionary_topic)
# form dfm
dfmt_feat <- dfm(toks)
dfmt_dict <- dfm(toks_dict)
# fit wordmap model
map <- textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)