CountVectorizer {superml}R Documentation

Count Vectorizer

Description

Creates CountVectorizer Model.

Details

Given a list of text, it generates a bag of words model and returns a sparse matrix consisting of token counts.

Public fields

sentences

a list containing sentences

max_df

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.

min_df

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.

max_features

Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

ngram_range

The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.

split

splitting criteria for strings, default: " "

lowercase

convert all characters to lowercase before tokenizing

regex

regex expression to use for text cleaning.

remove_stopwords

a list of stopwords to use, by default it uses its inbuilt list of standard stopwords

model

internal attribute which stores the count model

Methods

Public methods


Method new()

Usage
CountVectorizer$new(
  min_df,
  max_df,
  max_features,
  ngram_range,
  regex,
  remove_stopwords,
  split,
  lowercase
)
Arguments
min_df

numeric, When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold, value lies between 0 and 1.

max_df

numeric, When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold, value lies between 0 and 1.

max_features

integer, Build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

ngram_range

vector, The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted. All values of n such such that min_n <= n <= max_n will be used. For example an ngram_range of c(1, 1) means only unigrams, c(1, 2) means unigrams and bigrams, and c(2, 2) means only bigrams.

regex

character, regex expression to use for text cleaning.

remove_stopwords

list, a list of stopwords to use, by default it uses its inbuilt list of standard english stopwords

split

character, splitting criteria for strings, default: " "

lowercase

logical, convert all characters to lowercase before tokenizing, default: TRUE

Details

Create a new 'CountVectorizer' object.

Returns

A 'CountVectorizer' object.

Examples
cv = CountVectorizer$new(min_df=0.1)

Method fit()

Usage
CountVectorizer$fit(sentences)
Arguments
sentences

a list of text sentences

Details

Fits the countvectorizer model on sentences

Returns

NULL

Examples
sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)

Method fit_transform()

Usage
CountVectorizer$fit_transform(sentences)
Arguments
sentences

a list of text sentences

Details

Fits the countvectorizer model and returns a sparse matrix of count of tokens

Returns

a sparse matrix containing count of tokens in each given sentence

Examples
sents = c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)

Method transform()

Usage
CountVectorizer$transform(sentences)
Arguments
sentences

a list of new text sentences

Details

Returns a matrix of count of tokens

Returns

a sparse matrix containing count of tokens in each given sentence

Examples
sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)

Method clone()

The objects of this class are cloneable with this method.

Usage
CountVectorizer$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `CountVectorizer$new`
## ------------------------------------------------

cv = CountVectorizer$new(min_df=0.1)

## ------------------------------------------------
## Method `CountVectorizer$fit`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)

## ------------------------------------------------
## Method `CountVectorizer$fit_transform`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
         'alone in the dark?', 'many mothers in the lot....')
cv <- CountVectorizer$new(min_df=0.1)
cv_count_matrix <- cv$fit_transform(sents)

## ------------------------------------------------
## Method `CountVectorizer$transform`
## ------------------------------------------------

sents = c('i am alone in dark.','mother_mary a lot',
          'alone in the dark?', 'many mothers in the lot....')
new_sents <- c("dark at night",'mothers day')
cv = CountVectorizer$new(min_df=0.1)
cv$fit(sents)
cv_count_matrix <- cv$transform(new_sents)

[Package superml version 0.5.7 Index]