R: TfIdf

TfIdf {text2vec}

R Documentation

TfIdf

Description

Creates TfIdf(Latent semantic analysis) model. "smooth" IDF (default) is defined as follows: idf = log(1 + (# documents in the corpus) / (# documents where the term appears) ) "non-smooth" IDF is defined as follows: idf = log((# documents in the corpus) / (# documents where the term appears) )

Usage

TfIdf

Format

R6Class object.

Details

Term Frequency Inverse Document Frequency

Usage

For usage details see Methods, Arguments and Examples sections.

tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE)
tfidf$fit_transform(x)
tfidf$transform(x)

Methods

$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE): Creates tf-idf model
$fit_transform(x): fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x): transform new data x using tf-idf from train data

Arguments

tfidf: A TfIdf object
x: An input term-co-occurence matrix. Preferably in dgCMatrix format
smooth_idf: TRUE smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once.
norm: c("l1", "l2", "none") Type of normalization to apply to term vectors. "l1" by default, i.e., scale by the number of words in the document.
sublinear_tf: FALSE Apply sublinear term-frequency scaling, i.e., replace the term frequency with 1 + log(TF)

Examples

data("movie_review")
N = 100
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
dtm = create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf = TfIdf$new()
dtm_tfidf = model_tfidf$fit_transform(dtm)

[Package text2vec version 0.6.4 Index]