seq_builder {text2map} | R Documentation |
Represent Documents as Token-Integer Sequences
Description
First, each token in the vocabulary is mapped to an integer in a lookup dictionary. Next, documents are converted to sequences of integers where each integer is an index of the token from the dictionary.
Usage
seq_builder(
data,
text,
doc_id = NULL,
vocab = NULL,
maxlen = NULL,
matrix = TRUE
)
Arguments
data |
Data.frame with column of texts and column of document ids |
text |
Name of the column with documents' text |
doc_id |
Name of the column with documents' unique ids. |
vocab |
Default is |
maxlen |
Integer indicating the maximum document length. If NULL (default), the length of the longest document is used. |
matrix |
Logical, |
Details
Function will return a matrix of integer sequences by default.
The columns will be the length of the longest document or
maxlen
, with shorter documents padded with zeros. The
dictionary will be an attribute of the matrix accessed with
attr(seq, "dic")
. If matrix = FALSE
, the function will
return a list of integer sequences. The vocabulary will either
be each unique token in the corpus, or a the list of words
provided to the vocab
argument. This kind of text
representation is used in tensorflow
and keras.
Value
returns a matrix or list
Author(s)
Dustin Stoltz