CountVectorizer {superml} | R Documentation |
Count Vectorizer
Description
Creates CountVectorizer Model.
Details
Given a list of text, it generates a bag of words model and returns a sparse matrix consisting of token counts.
Public fields
sentences
a list containing sentences
max_df
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_df
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_features
Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
split
splitting criteria for strings, default: " "
lowercase
convert all characters to lowercase before tokenizing
regex
regex expression to use for text cleaning.
remove_stopwords
a list of stopwords to use, by default it uses its inbuilt list of standard stopwords
model
internal attribute which stores the count model
Methods
Public methods
Method new()
Usage
CountVectorizer$new( min_df, max_df, max_features, ngram_range, regex, remove_stopwords, split, lowercase )
Arguments
min_df
numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_df
numeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_features
integer, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_range
vector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regex
character, regex expression to use for text cleaning.
remove_stopwords
list, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
split
character, splitting criteria for strings, default: " "
lowercase
logical, convert all characters to lowercase before tokenizing, default: TRUE
Details
Create a new 'CountVectorizer' object.
Returns
A 'CountVectorizer' object.
Examples
cv = CountVectorizer$new(min_df=0.1)
Method fit()
Usage
CountVectorizer$fit(sentences)
Arguments
sentences
a list of text sentences
Details
Fits the countvectorizer model on sentences
Returns
NULL
Examples
sents = c('i am alone in dark.','mother_mary a lot', 'alone in the dark?', 'many mothers in the lot....') cv = CountVectorizer$new(min_df=0.1) cv$fit(sents)
Method fit_transform()
Usage
CountVectorizer$fit_transform(sentences)
Arguments
sentences
a list of text sentences
Details
Fits the countvectorizer model and returns a sparse matrix of count of tokens
Returns
a sparse matrix containing count of tokens in each given sentence
Examples
sents = c('i am alone in dark.','mother_mary a lot', 'alone in the dark?', 'many mothers in the lot....') cv <- CountVectorizer$new(min_df=0.1) cv_count_matrix <- cv$fit_transform(sents)
Method transform()
Usage
CountVectorizer$transform(sentences)
Arguments
sentences
a list of new text sentences
Details
Returns a matrix of count of tokens
Returns
a sparse matrix containing count of tokens in each given sentence
Examples
sents = c('i am alone in dark.','mother_mary a lot', 'alone in the dark?', 'many mothers in the lot....') new_sents <- c("dark at night",'mothers day') cv = CountVectorizer$new(min_df=0.1) cv$fit(sents) cv_count_matrix <- cv$transform(new_sents)
Method clone()
The objects of this class are cloneable with this method.
Usage
CountVectorizer$clone(deep = FALSE)
Arguments
deep
Whether to make a deep clone.
Examples
## ------------------------------------------------
## Method `CountVectorizer$new`
## ------------------------------------------------
cv = CountVectorizer$new(min_df=0.1)
## ------------------------------------------------
## Method `CountVectorizer$fit`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
## ------------------------------------------------
## Method `CountVectorizer$fit_transform`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)
## ------------------------------------------------
## Method `CountVectorizer$transform`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)