| seq_builder {text2map} | R Documentation |
Represent Documents as Token-Integer Sequences
Description
First, each token in the vocabulary is mapped to an integer in a lookup dictionary. Next, documents are converted to sequences of integers where each integer is an index of the token from the dictionary.
Usage
seq_builder(
data,
text,
doc_id = NULL,
vocab = NULL,
maxlen = NULL,
matrix = TRUE
)
Arguments
data |
Data.frame with column of texts and column of document ids |
text |
Name of the column with documents' text |
doc_id |
Name of the column with documents' unique ids. |
vocab |
Default is |
maxlen |
Integer indicating the maximum document length. If NULL (default), the length of the longest document is used. |
matrix |
Logical, |
Details
Function will return a matrix of integer sequences by default.
The columns will be the length of the longest document or
maxlen, with shorter documents padded with zeros. The
dictionary will be an attribute of the matrix accessed with
attr(seq, "dic"). If matrix = FALSE, the function will
return a list of integer sequences. The vocabulary will either
be each unique token in the corpus, or a the list of words
provided to the vocab argument. This kind of text
representation is used in tensorflow
and keras.
Value
returns a matrix or list
Author(s)
Dustin Stoltz