compare_subset {corpustools}R Documentation

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus

Description

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus

Usage

compare_subset(
  tc,
  feature,
  subset_x = NULL,
  subset_meta_x = NULL,
  query_x = NULL,
  query_feature = "token",
  smooth = 0.1,
  min_ratio = NULL,
  min_chi2 = NULL,
  yates_cor = c("auto", "yes", "no"),
  what = c("freq", "docfreq", "cooccurrence")
)

Arguments

tc

a tCorpus

feature

the column name of the feature that is to be compared

subset_x

an expression to subset the tCorpus. The vocabulary of the subset will be compared to the rest of the tCorpus

subset_meta_x

like subset_x, but using using the meta data

query_x

like subset_x, but using a query search to select documents (see search_contexts)

query_feature

if query_x is used, the column name of the feature used in the query search.

smooth

Laplace smoothing is used for the calculation of the probabilities. Here you can set the added (pseuocount) value.

min_ratio

threshold for the ratio value, which is the ratio of the relative frequency of a term in dtm.x and dtm.y

min_chi2

threshold for the chi^2 value

yates_cor

mode for using yates correctsion in the chi^2 calculation. Can be turned on ("yes") or off ("no"), or set to "auto", in which case cochrans rule is used to determine whether yates' correction is used.

what

choose whether to compare the frequency ("freq") of terms, or the document frequency ("docfreq"). This also affects how chi^2 is calculated, comparing either freq relative to vocabulary size or docfreq relative to corpus size (N)

Value

A vocabularyComparison object

Examples


tc = create_tcorpus(sotu_texts, doc_column = 'id')

tc$preprocess('token', 'feature', remove_stopwords = TRUE, use_stemming = TRUE)

comp = compare_subset(tc, 'feature', subset_meta_x = president == 'Barack Obama')
comp = comp[order(-comp$chi),]
head(comp)
plot(comp)

comp = compare_subset(tc, 'feature', query_x = 'terroris*')
comp = comp[order(-comp$chi),]
head(comp, 10)


[Package corpustools version 0.5.1 Index]