tf_idf {labourR}R Documentation

Term frequency–Inverse document frequency

Description

Measure weighted amount of information concerning the specificity of terms in a corpus. Term frequency–Inverse document frequency is one of the most frequently applied weighting schemes in information retrieval systems. The tf–idf is a statistical measure proportional to the number of times a word appears in the document, but is offset by the number of documents in the corpus that contain the word. Variations of the tf–idf are often used to estimate a document's relevance given a free-text query.

Usage

tf_idf(
  corpus,
  stopwords = NULL,
  id_col = "id",
  text_col = "text",
  tf_weight = "double_norm",
  idf_weight = "idf_smooth",
  min_chars = 2,
  norm = TRUE
)

Arguments

corpus

Input data, with an id column and a text column. Can be of type data.frame or data.table.

stopwords

A character vector of stopwords. Stopwords are filtered out before calculating numerical statistics.

id_col

Input data column name with the ids of the documents.

text_col

Input data column name with the documents.

tf_weight

Weighting scheme of term frequency. Choices are raw_count, double_norm or log_norm for raw count, double normalization at 0.5 and log normalization respectively.

idf_weight

Weighting scheme of inverse document frequency. Choices are idf and idf_smooth for inverse document frequency and inverse document frequency smooth respectively.

min_chars

Words with less characters than min_chars are filtered out before calculating numerical statistics.

norm

Boolean value for document normalization.

Value

A data.table with three columns, namely class derived from given document ids, term and tfIdf.

Examples

library(data.table)
corpus <- copy(occupations_bundle)
invisible(corpus[, text := paste(preferredLabel, altLabels)])
invisible(corpus[, text := cleansing_corpus(text)])
corpus <- corpus[ , .(conceptUri, text)]
setnames(corpus, c("id", "text"))
tf_idf(corpus)


[Package labourR version 1.0.0 Index]