R: Vocabulary and hash vectorizers

vectorizers {text2vec}

R Documentation

Vocabulary and hash vectorizers

Description

This function creates an object (closure) which defines on how to transform list of tokens into vector space - i.e. how to map words to indices. It supposed to be used only as argument to create_dtm, create_tcm, create_vocabulary.

Usage

vocab_vectorizer(vocabulary)

hash_vectorizer(hash_size = 2^18, ngram = c(1L, 1L),
  signed_hash = FALSE)

Arguments

`vocabulary`	`text2vec_vocabulary` object, see create_vocabulary.
`hash_size`	`integer` The number of of hash-buckets for the feature hashing trick. The number must be greater than 0, and preferably it will be a power of 2.
`ngram`	`integer` vector. The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of `n` such that ngram_min <= n <= ngram_max will be used.
`signed_hash`	`logical`, indicating whether to use a signed hash-function to reduce collisions when hashing.

Value

A vectorizer object (closure).

Examples

data("movie_review")
N = 100
vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L))
it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, n_chunks = 10)
hash_dtm = create_dtm(it, vectorizer)

it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, n_chunks = 10)
v = create_vocabulary(it, c(1L, 1L) )

vectorizer = vocab_vectorizer(v)

it = itoken(movie_review$review[1:N], preprocess_function = tolower,
             tokenizer = word_tokenizer, n_chunks = 10)

dtm = create_dtm(it, vectorizer)