| TfIdf {text2vec} | R Documentation |
TfIdf
Description
Creates TfIdf(Latent semantic analysis) model.
"smooth" IDF (default) is defined as follows: idf = log(1 + (# documents in the corpus) / (# documents where the term appears) )
"non-smooth" IDF is defined as follows: idf = log((# documents in the corpus) / (# documents where the term appears) )
Usage
TfIdf
Format
R6Class object.
Details
Term Frequency Inverse Document Frequency
Usage
For usage details see Methods, Arguments and Examples sections.
tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE)
tfidf$fit_transform(x)
tfidf$transform(x)
Methods
$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE)Creates tf-idf model
$fit_transform(x)fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x)transform new data
xusing tf-idf from train data
Arguments
- tfidf
A
TfIdfobject- x
An input term-co-occurence matrix. Preferably in
dgCMatrixformat- smooth_idf
TRUEsmooth IDF weights by adding one to document frequencies, as if an extra document was seen containing every term in the collection exactly once.- norm
c("l1", "l2", "none")Type of normalization to apply to term vectors."l1"by default, i.e., scale by the number of words in the document.- sublinear_tf
FALSEApply sublinear term-frequency scaling, i.e., replace the term frequency with1 + log(TF)
Examples
data("movie_review")
N = 100
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
dtm = create_dtm(itoken(tokens), hash_vectorizer())
model_tfidf = TfIdf$new()
dtm_tfidf = model_tfidf$fit_transform(dtm)