compare_corpus {corpustools} R Documentation

## Compare tCorpus vocabulary to that of another (reference) tCorpus

### Description

Compare tCorpus vocabulary to that of another (reference) tCorpus

### Usage

compare_corpus(
tc,
tc_y,
feature,
smooth = 0.1,
min_ratio = NULL,
min_chi2 = NULL,
is_subset = F,
yates_cor = c("auto", "yes", "no"),
what = c("freq", "docfreq", "cooccurrence")
)


### Arguments

 tc a tCorpus tc_y the reference tCorpus feature the column name of the feature that is to be compared smooth Laplace smoothing is used for the calculation of the probabilities. Here you can set the added (pseuocount) value. min_ratio threshold for the ratio value, which is the ratio of the relative frequency of a term in dtm.x and dtm.y min_chi2 threshold for the chi^2 value is_subset Specify whether tc is a subset of tc_y. In this case, the term frequencies of tc will be subtracted from the term frequencies in tc_y yates_cor mode for using yates correctsion in the chi^2 calculation. Can be turned on ("yes") or off ("no"), or set to "auto", in which case cochrans rule is used to determine whether yates' correction is used. what choose whether to compare the frequency ("freq") of terms, or the document frequency ("docfreq"). This also affects how chi^2 is calculated, comparing either freq relative to vocabulary size or docfreq relative to corpus size (N)

### Value

A vocabularyComparison object

### Examples


tc = create_tcorpus(sotu_texts, doc_column = 'id')

tc$preprocess('token', 'feature', remove_stopwords = TRUE, use_stemming = TRUE) obama = tc$subset_meta(president == 'Barack Obama', copy=TRUE)
bush = tc$subset_meta(president == 'George W. Bush', copy=TRUE) comp = compare_corpus(tc, bush, 'feature') comp = comp[order(-comp$chi),]