textmodel_wordmap {wordmap}R Documentation

A model for multinomial feature extraction and document classification

Description

Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.

Usage

textmodel_wordmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 1,
  boolean = FALSE,
  drop_label = TRUE,
  verbose = quanteda_options("verbose"),
  entropy = c("none", "global", "local", "average"),
  ...
)

Arguments

x

a dfm or fcm created by quanteda::dfm()

y

a dfm or a sparse matrix that record class membership of the documents. It can be created applying quanteda::dfm_lookup() to x.

label

if "max", uses only labels for the maximum value in each row of y.

smooth

a value added to the frequency of words to smooth likelihood ratios.

boolean

if TRUE, only consider presence or absence of features in each document to limit the impact of words repeated in few documents.

drop_label

if TRUE, drops empty columns of y and ignore their labels.

verbose

if TRUE, shows progress of training.

entropy

[experimental] the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if global or over documents with the same labels if local. Local entropy is averaged if average. See the details.

...

additional arguments passed to internal functions.

Details

Wordmap learns association between words and classes as likelihood ratios based on the features in x and the labels in y. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

Value

Returns a fitted textmodel_wordmap object with the following elements:

model

a matrix that records the association between classes and features.

data

the original input of x.

feature

the feature set in the model.

concatenator

the concatenator in x.

entropy

the type of entropy weights used.

boolean

the use of the Boolean transformation of x.

call

the command used to execute the function.

version

the version of the wordmap package.

References

Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.

Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Examples

require(quanteda)

# split into sentences
corp <- corpus_reshape(data_corpus_ungd2017)

# tokenize
toks <- tokens(corp, remove_punct = TRUE) %>%
   tokens_remove(stopwords("en"))

# apply seed dictionary
toks_dict <- tokens_lookup(toks, data_dictionary_topic)

# form dfm
dfmt_feat <- dfm(toks)
dfmt_dict <- dfm(toks_dict)

# fit wordmap model
map <- textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)


[Package wordmap version 0.8.0 Index]