TfIdf {text2vec} | R Documentation |
TfIdf
Description
Creates TfIdf(Latent semantic analysis) model.
"smooth" IDF (default) is defined as follows: idf = log(1 + (# documents in the corpus) / (# documents where the term appears) )
"non-smooth" IDF is defined as follows: idf = log((# documents in the corpus) / (# documents where the term appears) )
Usage
TfIdf
Format
R6Class
object.
Details
Term Frequency Inverse Document Frequency
Usage
For usage details see Methods, Arguments and Examples sections.
tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE) tfidf$fit_transform(x) tfidf$transform(x)
Methods
$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE)
Creates tf-idf model
$fit_transform(x)
fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x)
transform new data
x
using tf-idf from train data
Arguments
- tfidf
A
TfIdf
object- x
An input term-co-occurence matrix. Preferably in
dgCMatrix
format- smooth_idf
TRUE
smooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once.- norm
c("l1", "l2", "none")
Type of normalization to apply to term vectors."l1"
by default, i.e., scale by the number of words in the document.- sublinear_tf
FALSE
Apply sublinear term-frequency scaling, i.e., replace the term frequency with1 + log(TF)
Examples
data("movie_review")
N = 100
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
dtm = create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf = TfIdf$new()
dtm_tfidf = model_tfidf$fit_transform(dtm)