R: A model for multinomial feature extraction and document...

textmodel_wordmap {wordmap}

R Documentation

A model for multinomial feature extraction and document classification

Description

Wordmap is a model for multinomial feature extraction and document classification. Its naive Bayesian algorithm allows users to train the model on a large corpus with noisy labels given by document meta-data or keyword matching.

Usage

textmodel_wordmap(
  x,
  y,
  label = c("all", "max"),
  smooth = 1,
  boolean = FALSE,
  drop_label = TRUE,
  verbose = quanteda_options("verbose"),
  entropy = c("none", "global", "local", "average"),
  ...
)

Arguments

`x`	a dfm or fcm created by `quanteda::dfm()`
`y`	a dfm or a sparse matrix that record class membership of the documents. It can be created applying `quanteda::dfm_lookup()` to `x`.
`label`	if "max", uses only labels for the maximum value in each row of `y`.
`smooth`	a value added to the frequency of words to smooth likelihood ratios.
`boolean`	if `TRUE`, only consider presence or absence of features in each document to limit the impact of words repeated in few documents.
`drop_label`	if `TRUE`, drops empty columns of `y` and ignore their labels.
`verbose`	if `TRUE`, shows progress of training.
`entropy`	[experimental] the scheme to compute the entropy to regularize likelihood ratios. The entropy of features are computed over labels if `global` or over documents with the same labels if `local`. Local entropy is averaged if `average`. See the details.
`...`	additional arguments passed to internal functions.

Details

Wordmap learns association between words and classes as likelihood ratios based on the features in x and the labels in y. The large likelihood ratios tend to concentrate to a small number of features but the entropy of their frequencies over labels or documents helps to disperse the distribution.

Value

Returns a fitted textmodel_wordmap object with the following elements:

`model`	a matrix that records the association between classes and features.
`data`	the original input of `x`.
`feature`	the feature set in the model.
`concatenator`	the concatenator in `x`.
`entropy`	the type of entropy weights used.
`boolean`	the use of the Boolean transformation of `x`.
`call`	the command used to execute the function.
`version`	the version of the wordmap package.

References

Watanabe, Kohei (2018). "Newsmap: semi-supervised approach to geographical news classification". doi.org/10.1080/21670811.2017.1293487, Digital Journalism.

Watanabe, Kohei & Zhou, Yuan (2020). "Theory-Driven Analysis of Large Corpora: Semisupervised Topic Classification of the UN Speeches". doi:10.1177/0894439320907027. Social Science Computer Review.

Examples

require(quanteda)

# split into sentences
corp <- corpus_reshape(data_corpus_ungd2017)

# tokenize
toks <- tokens(corp, remove_punct = TRUE) %>%
   tokens_remove(stopwords("en"))

# apply seed dictionary
toks_dict <- tokens_lookup(toks, data_dictionary_topic)

# form dfm
dfmt_feat <- dfm(toks)
dfmt_dict <- dfm(toks_dict)

# fit wordmap model
map <- textmodel_wordmap(dfmt_feat, dfmt_dict)
coef(map)
predict(map)

[Package wordmap version 0.8.0 Index]