tf_idf {labourR} | R Documentation |
Term frequency–Inverse document frequency
Description
Measure weighted amount of information concerning the specificity of terms in a corpus. Term frequency–Inverse document frequency is one of the most frequently applied weighting schemes in information retrieval systems. The tf–idf is a statistical measure proportional to the number of times a word appears in the document, but is offset by the number of documents in the corpus that contain the word. Variations of the tf–idf are often used to estimate a document's relevance given a free-text query.
Usage
tf_idf(
corpus,
stopwords = NULL,
id_col = "id",
text_col = "text",
tf_weight = "double_norm",
idf_weight = "idf_smooth",
min_chars = 2,
norm = TRUE
)
Arguments
corpus |
Input data, with an id column and a text column. Can be of type data.frame or data.table. |
stopwords |
A character vector of stopwords. Stopwords are filtered out before calculating numerical statistics. |
id_col |
Input data column name with the ids of the documents. |
text_col |
Input data column name with the documents. |
tf_weight |
Weighting scheme of term frequency. Choices are |
idf_weight |
Weighting scheme of inverse document frequency. Choices are |
min_chars |
Words with less characters than |
norm |
Boolean value for document normalization. |
Value
A data.table with three columns, namely class
derived from given document ids, term
and tfIdf
.
Examples
library(data.table)
corpus <- copy(occupations_bundle)
invisible(corpus[, text := paste(preferredLabel, altLabels)])
invisible(corpus[, text := cleansing_corpus(text)])
corpus <- corpus[ , .(conceptUri, text)]
setnames(corpus, c("id", "text"))
tf_idf(corpus)