| CountVectorizer {superml} | R Documentation |
Count Vectorizer
Description
Creates CountVectorizer Model.
Details
Given a list of text, it generates a bag of words model and returns a sparse matrix consisting of token counts.
Public fields
sentencesa list containing sentences
max_dfWhen building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
min_dfWhen building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_featuresBuild a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_rangeThe lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
splitsplitting criteria for strings, default: " "
lowercaseconvert all characters to lowercase before tokenizing
regexregex expression to use for text cleaning.
remove_stopwordsa list of stopwords to use, by default it uses its inbuilt list of standard stopwords
modelinternal attribute which stores the count model
Methods
Public methods
Method new()
Usage
CountVectorizer$new( min_df, max_df, max_features, ngram_range, regex, remove_stopwords, split, lowercase )
Arguments
min_dfnumeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.
max_dfnumeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.
max_featuresinteger, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
ngram_rangevector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.
regexcharacter, regex expression to use for text cleaning.
remove_stopwordslist, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords
splitcharacter, splitting criteria for strings, default: " "
lowercaselogical, convert all characters to lowercase before tokenizing, default: TRUE
Details
Create a new 'CountVectorizer' object.
Returns
A 'CountVectorizer' object.
Examples
cv = CountVectorizer$new(min_df=0.1)
Method fit()
Usage
CountVectorizer$fit(sentences)
Arguments
sentencesa list of text sentences
Details
Fits the countvectorizer model on sentences
Returns
NULL
Examples
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
Method fit_transform()
Usage
CountVectorizer$fit_transform(sentences)
Arguments
sentencesa list of text sentences
Details
Fits the countvectorizer model and returns a sparse matrix of count of tokens
Returns
a sparse matrix containing count of tokens in each given sentence
Examples
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)
Method transform()
Usage
CountVectorizer$transform(sentences)
Arguments
sentencesa list of new text sentences
Details
Returns a matrix of count of tokens
Returns
a sparse matrix containing count of tokens in each given sentence
Examples
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)
Method clone()
The objects of this class are cloneable with this method.
Usage
CountVectorizer$clone(deep = FALSE)
Arguments
deepWhether to make a deep clone.
Examples
## ------------------------------------------------
## Method `CountVectorizer$new`
## ------------------------------------------------
cv = CountVectorizer$new(min_df=0.1)
## ------------------------------------------------
## Method `CountVectorizer$fit`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
## ------------------------------------------------
## Method `CountVectorizer$fit_transform`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)
## ------------------------------------------------
## Method `CountVectorizer$transform`
## ------------------------------------------------
sents = c('i am alone in dark.','mother_mary a lot',
'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)