term_matrix {corpus}  R Documentation 
Tokenize a set of texts and compute a term frequency matrix.
term_matrix(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, transpose = FALSE, ...) term_counts(x, filter = NULL, ngrams = NULL, select = NULL, group = NULL, ...)
x 
a text vector to tokenize. 
filter 
if non 
ngrams 
an integer vector of ngram lengths to include, or

select 
a character vector of terms to count, or 
group 
if non 
transpose 
a logical value indicating whether to transpose the result, putting terms as rows instead of columns. 
... 
additional properties to set on the text filter. 
term_matrix
tokenizes a set of texts and computes the occurrence
counts for each term, returning the result as a sparse matrix
(textsbyterms). term_counts
returns the same information, but
in a data frame.
If ngrams
is nonNULL
, then multitype ngrams are
included in the output for all lengths appearing in the ngrams
argument. If ngrams
is NULL
but select
is
nonNULL
, then all ngrams appearing in the select
set
are included. If both ngrams
and select
are NULL
,
then only unigrams (single type terms) are included.
If group
is NULL
, then the output has one set of term
counts for each input text. Otherwise, we convert group
to
a factor
and compute one set of term counts for each level.
Texts with NA
values for group
get skipped.
term_matrix
with transpose = FALSE
returns a sparse matrix
in "dgCMatrix"
format with one column for each term and one row for
each input text or (if group
is nonNULL
) for each grouping
level. If filter$select
is nonNULL
, then the column names
will be equal to filter$select
. Otherwise, the columns are assigned
in arbitrary order.
term_matrix
with transpose = TRUE
returns the transpose of
the term matrix, in "dgCMatrix"
format.
term_counts
with group = NULL
returns a data frame with one
row for each entry of the term matrix, and columns "text"
,
"term"
, and "count"
giving the text ID, term, and count.
The "term"
column is a factor with levels equal to the selected
terms. The "text"
column is a factor with levels equal to names(as_corpus_text(x))
;
calling as.integer
on the "text"
column converts from
the factor values to the integer row index in the term matrix.
term_counts
with group
nonNULL
behaves similarly,
but the result instead has columns named "group"
, "term"
,
and "count"
, with "group"
giving the grouping level, as
a factor.
text < c("A rose is a rose is a rose.", "A Rose is red, a violet is blue!", "A rose by any other name would smell as sweet.") term_matrix(text) # select certain terms term_matrix(text, select = c("rose", "red", "violet", "sweet")) # specify a grouping factor term_matrix(text, group = c("Good", "Bad", "Good")) # include higherorder ngrams term_matrix(text, ngrams = 1:3) # select certain multitype terms term_matrix(text, select = c("a rose", "a violet", "sweet", "smell")) # transpose the result term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows # data frame head(term_counts(text), n = 10) # first 10 rows # with grouping term_counts(text, group = c("Good", "Bad", "Good")) # taking names from the input term_counts(c(a = "One sentence.", b = "Another", c = "!!"))