R: Term frequency–Inverse document frequency

tf_idf {labourR}

R Documentation

Term frequency–Inverse document frequency

Description

Measure weighted amount of information concerning the specificity of terms in a corpus. Term frequency–Inverse document frequency is one of the most frequently applied weighting schemes in information retrieval systems. The tf–idf is a statistical measure proportional to the number of times a word appears in the document, but is offset by the number of documents in the corpus that contain the word. Variations of the tf–idf are often used to estimate a document's relevance given a free-text query.

Usage

tf_idf(
  corpus,
  stopwords = NULL,
  id_col = "id",
  text_col = "text",
  tf_weight = "double_norm",
  idf_weight = "idf_smooth",
  min_chars = 2,
  norm = TRUE
)

Arguments

`corpus`	Input data, with an id column and a text column. Can be of type data.frame or data.table.
`stopwords`	A character vector of stopwords. Stopwords are filtered out before calculating numerical statistics.
`id_col`	Input data column name with the ids of the documents.
`text_col`	Input data column name with the documents.
`tf_weight`	Weighting scheme of term frequency. Choices are `raw_count`, `double_norm` or `log_norm` for raw count, double normalization at 0.5 and log normalization respectively.
`idf_weight`	Weighting scheme of inverse document frequency. Choices are `idf` and `idf_smooth` for inverse document frequency and inverse document frequency smooth respectively.
`min_chars`	Words with less characters than `min_chars` are filtered out before calculating numerical statistics.
`norm`	Boolean value for document normalization.

Value

A data.table with three columns, namely class derived from given document ids, term and tfIdf.

Examples

library(data.table)
corpus <- copy(occupations_bundle)
invisible(corpus[, text := paste(preferredLabel, altLabels)])
invisible(corpus[, text := cleansing_corpus(text)])
corpus <- corpus[ , .(conceptUri, text)]
setnames(corpus, c("id", "text"))
tf_idf(corpus)

[Package labourR version 1.0.0 Index]