R: Compare vocabulary of a subset of a tCorpus to the rest of...

compare_subset {corpustools}

R Documentation

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus

Description

Compare vocabulary of a subset of a tCorpus to the rest of the tCorpus

Usage

compare_subset(
  tc,
  feature,
  subset_x = NULL,
  subset_meta_x = NULL,
  query_x = NULL,
  query_feature = "token",
  smooth = 0.1,
  min_ratio = NULL,
  min_chi2 = NULL,
  yates_cor = c("auto", "yes", "no"),
  what = c("freq", "docfreq", "cooccurrence")
)

Arguments

`tc`	a `tCorpus`
`feature`	the column name of the feature that is to be compared
`subset_x`	an expression to subset the tCorpus. The vocabulary of the subset will be compared to the rest of the tCorpus
`subset_meta_x`	like subset_x, but using using the meta data
`query_x`	like subset_x, but using a query search to select documents (see search_contexts)
`query_feature`	if query_x is used, the column name of the feature used in the query search.
`smooth`	Laplace smoothing is used for the calculation of the probabilities. Here you can set the added (pseuocount) value.
`min_ratio`	threshold for the ratio value, which is the ratio of the relative frequency of a term in dtm.x and dtm.y
`min_chi2`	threshold for the chi^2 value
`yates_cor`	mode for using yates correctsion in the chi^2 calculation. Can be turned on ("yes") or off ("no"), or set to "auto", in which case cochrans rule is used to determine whether yates' correction is used.
`what`	choose whether to compare the frequency ("freq") of terms, or the document frequency ("docfreq"). This also affects how chi^2 is calculated, comparing either freq relative to vocabulary size or docfreq relative to corpus size (N)

Value

A vocabularyComparison object

Examples


tc = create_tcorpus(sotu_texts, doc_column = 'id')

tc$preprocess('token', 'feature', remove_stopwords = TRUE, use_stemming = TRUE)

comp = compare_subset(tc, 'feature', subset_meta_x = president == 'Barack Obama')
comp = comp[order(-comp$chi),]
head(comp)
plot(comp)

comp = compare_subset(tc, 'feature', query_x = 'terroris*')
comp = comp[order(-comp$chi),]
head(comp, 10)

[Package corpustools version 0.5.1 Index]