TfIdfVectorizer {superml} | R Documentation |
TfIDF(Term Frequency Inverse Document Frequency) Vectorizer
Description
Creates a tf-idf matrix
Details
Given a list of text, it creates a sparse matrix consisting of tf-idf score for tokens from the text.
Super class
superml::CountVectorizer
-> TfIdfVectorizer
Public fields
sentences
a list containing sentences
max_df
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_df
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_features
use top features sorted by count to be used in bag of words matrix.
ngram_range
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
split
splitting criteria for strings, default: " "
lowercase
convert all characters to lowercase before tokenizing
regex
regex expression to use for text cleaning.
remove_stopwords
a list of stopwords to use, by default it uses its inbuilt list of standard stopwords
smooth_idf
logical, to prevent zero division, adds one to document frequencies, as if an extra document was seen containing every term in the collection exactly once
norm
logical, if TRUE, each output row will have unit norm ‘l2’: Sum of squares of vector elements is 1. if FALSE returns non-normalized vectors, default: TRUE
Methods
Public methods
Method new()
Usage
TfIdfVectorizer$new( min_df, max_df, max_features, ngram_range, regex, remove_stopwords, split, lowercase, smooth_idf, norm )
Arguments
min_df
numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_df
numeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_features
integer, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range
vector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regex
character, regex expression to use for text cleaning.
remove_stopwords
list, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
split
character, splitting criteria for strings, default: " "
lowercase
logical, convert all characters to lowercase before tokenizing, default: TRUE
smooth_idf
logical, to prevent zero division, adds one to document frequencies, as if an extra document was seen containing every term in the collection exactly once
norm
logical, if TRUE, each output row will have unit norm ‘l2’: Sum of squares of vector elements is 1. if FALSE returns non-normalized vectors, default: TRUE
parallel
logical, speeds up ngrams computation using n-1 cores, defaults: TRUE
Details
Create a new 'TfIdfVectorizer' object.
Returns
A 'TfIdfVectorizer' object.
Examples
TfIdfVectorizer$new()
Method fit()
Usage
TfIdfVectorizer$fit(sentences)
Arguments
sentences
a list of text sentences
Details
Fits the TfIdfVectorizer model on sentences
Returns
NULL
Examples
sents = c('i am alone in dark.','mother_mary a lot', 'alone in the dark?', 'many mothers in the lot....') tf = TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.3) tf$fit(sents)
Method fit_transform()
Usage
TfIdfVectorizer$fit_transform(sentences)
Arguments
sentences
a list of text sentences
Details
Fits the TfIdfVectorizer model and returns a sparse matrix of count of tokens
Returns
a sparse matrix containing tf-idf score for tokens in each given sentence
Examples
\dontrun{ sents <- c('i am alone in dark.','mother_mary a lot', 'alone in the dark?', 'many mothers in the lot....') tf <- TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.1) tf_matrix <- tf$fit_transform(sents) }
Method transform()
Usage
TfIdfVectorizer$transform(sentences)
Arguments
sentences
a list of new text sentences
Details
Returns a matrix of tf-idf score of tokens
Returns
a sparse matrix containing tf-idf score for tokens in each given sentence
Examples
\dontrun{ sents = c('i am alone in dark.','mother_mary a lot', 'alone in the dark?', 'many mothers in the lot....') new_sents <- c("dark at night",'mothers day') tf = TfIdfVectorizer$new(min_df=0.1) tf$fit(sents) tf_matrix <- tf$transform(new_sents) }
Method clone()
The objects of this class are cloneable with this method.
Usage
TfIdfVectorizer$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `TfIdfVectorizer$new`
## ------------------------------------------------
TfIdfVectorizer$new()
## ------------------------------------------------
## Method `TfIdfVectorizer$fit`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
tf = TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.3)
tf$fit(sents)
## ------------------------------------------------
## Method `TfIdfVectorizer$fit_transform`
## ------------------------------------------------
## Not run:
sents <- c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
tf <- TfIdfVectorizer$new(smooth_idf = TRUE, min_df = 0.1)
tf_matrix <- tf$fit_transform(sents)
## End(Not run)
## ------------------------------------------------
## Method `TfIdfVectorizer$transform`
## ------------------------------------------------
## Not run:
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
tf = TfIdfVectorizer$new(min_df=0.1)
tf$fit(sents)
tf_matrix <- tf$transform(new_sents)
## End(Not run)