dtm_builder {text2map} | R Documentation |
A fast unigram DTM builder
Description
A streamlined function to take raw texts from a column of a data.frame and produce a sparse Document-Term Matrix (of generic class "dgCMatrix").
Usage
dtm_builder(
data,
text,
doc_id = NULL,
vocab = NULL,
chunk = NULL,
dense = FALSE,
omit_empty = FALSE
)
Arguments
data |
Data.frame with column of texts and column of document ids |
text |
Name of the column with documents' text |
doc_id |
Name of the column with documents' unique ids. |
vocab |
Default is |
chunk |
Default is |
dense |
The default ( |
omit_empty |
Logical (default = |
Details
The function is fast because it has few bells and whistles:
No weighting schemes other than raw counts
Tokenizes by the fixed, single whitespace
Only tokenizes unigrams. No bigrams, trigrams, etc...
Columns are in the order unique terms are discovered
No preprocessing during building
Outputs a basic sparse Matrix or dense matrix
Weighting or stopping terms can be done efficiently after the fact with
simple matrix operations, rather than achieved implicitly within the
function itself. For example, using the dtm_stopper()
function.
Prior to creating the DTM, texts should have whitespace trimmed, if
desired, punctuation removed and terms lowercased.
Like tidytext
's DTM functions, dtm_builder()
is optimized for use
in a pipeline, but unlike tidytext
, it does not build an intermediary
tripletlist, so dtm_builder()
is faster and far more memory
efficient.
The function can also chunk
the corpus into documents of a given length
(default is NULL
). If the integer provided is 200L
, this will divide
the corpus into new documents with 200 terms (with the final document
likely including slightly less than 200). If the total terms in the
corpus were less than or equal to chunk
integer, this would produce
a DTM with one document (most will probably not want this).
If the vocabulary is already known, or standardizing vocabulary across
several DTMs is desired, a list of terms can be provided to the vocab
argument. Columns of the DTM will be in the order of the list of terms.
Value
returns a document-term matrix of class "dgCMatrix" or class "matrix"
Author(s)
Dustin Stoltz
Examples
library(dplyr)
my_corpus <- data.frame(
text = c(
"I hear babies crying I watch them grow",
"They'll learn much more than I'll ever know",
"And I think to myself",
"What a wonderful world",
"Yes I think to myself",
"What a wonderful world"
),
line_id = paste0("line", seq_len(6))
)
## some text preprocessing
my_corpus$clean_text <- tolower(gsub("'", "", my_corpus$text))
# example 1 with R 4.1 pipe
dtm <- my_corpus |>
dtm_builder(clean_text, line_id)
# example 2 without pipe
dtm <- dtm_builder(
data = my_corpus,
text = clean_text,
doc_id = line_id
)
# example 3 with dplyr pipe and mutate
dtm <- my_corpus %>%
mutate(
clean_text = gsub("'", "", text),
clean_text = tolower(clean_text)
) %>%
dtm_builder(clean_text, line_id)
# example 4 with dplyr and chunk of 3 terms
dtm <- my_corpus %>%
dtm_builder(clean_text,
line_id,
chunk = 3L
)
# example 5 with user defined vocabulary
my.vocab <- c("wonderful", "world", "haiku", "think")
dtm <- dtm_builder(
data = my_corpus,
text = clean_text,
doc_id = line_id,
vocab = my.vocab
)