document_term_frequencies_statistics {udpipe}R Documentation

Add Term Frequency, Inverse Document Frequency and Okapi BM25 statistics to the output of document_term_frequencies

Description

Term frequency Inverse Document Frequency (tfidf) is calculated as the multiplication of

The Okapi BM25 statistic is calculated as the multiplication of the inverse document frequency and the weighted term frequency as defined at https://en.wikipedia.org/wiki/Okapi_BM25.

Usage

document_term_frequencies_statistics(x, k = 1.2, b = 0.75)

Arguments

x

a data.table as returned by document_term_frequencies containing the columns doc_id, term and freq.

k

parameter k1 of the Okapi BM25 ranking function as defined at https://en.wikipedia.org/wiki/Okapi_BM25. Defaults to 1.2.

b

parameter b of the Okapi BM25 ranking function as defined at https://en.wikipedia.org/wiki/Okapi_BM25. Defaults to 0.5.

Value

a data.table with columns doc_id, term, freq and added to that the computed statistics tf, idf, tfidf, tf_bm25 and bm25.

Examples

data(brussels_reviews_anno)

x <- document_term_frequencies(brussels_reviews_anno[, c("doc_id", "token")])
x <- document_term_frequencies_statistics(x)
head(x)

[Package udpipe version 0.8.11 Index]