R: A fast unigram vocabulary builder

vocab_builder {text2map}

R Documentation

A fast unigram vocabulary builder

Description

A streamlined function to take raw texts from a column of a data.frame and produce a list of all the unique tokens. Tokenizes by the fixed, single whitespace, and then extracts the unique tokens. This can be used as input to dtm_builder() to standardize the vocabulary (i.e. the columns) across multiple DTMs. Prior to building the vocabulary, texts should have whitespace trimmed, if desired, punctuation removed and terms lowercased.

Usage

vocab_builder(data, text)

Arguments

`data`	Data.frame with one column of texts
`text`	Name of the column with documents' text

Value

returns a list of unique terms in a corpus

Author(s)

Dustin Stoltz

[Package text2map version 0.2.0 Index]