as.TermDocumentMatrix {polmineR} | R Documentation |
Generate TermDocumentMatrix / DocumentTermMatrix.
Description
Methods to generate the classes TermDocumentMatrix
or
DocumentTermMatrix
as defined in the tm
package. There are
many text mining applications for document-term matrices. A
DocumentTermMatrix
is required as input by the topicmodels
package, for instance.
Usage
as.TermDocumentMatrix(x, ...)
as.DocumentTermMatrix(x, ...)
## S4 method for signature 'character'
as.TermDocumentMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)
## S4 method for signature 'corpus'
as.DocumentTermMatrix(
x,
p_attribute,
s_attribute,
stoplist = NULL,
binarize = FALSE,
verbose = TRUE,
...
)
## S4 method for signature 'character'
as.DocumentTermMatrix(x, p_attribute, s_attribute, verbose = TRUE, ...)
## S4 method for signature 'bundle'
as.TermDocumentMatrix(x, col, p_attribute = NULL, verbose = TRUE, ...)
## S4 method for signature 'bundle'
as.DocumentTermMatrix(x, col = NULL, p_attribute = NULL, verbose = TRUE, ...)
## S4 method for signature 'partition_bundle'
as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
## S4 method for signature 'partition_bundle'
as.TermDocumentMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
## S4 method for signature 'subcorpus_bundle'
as.TermDocumentMatrix(x, p_attribute = NULL, verbose = TRUE, ...)
## S4 method for signature 'subcorpus_bundle'
as.DocumentTermMatrix(x, p_attribute = NULL, verbose = TRUE, ...)
## S4 method for signature 'partition_bundle'
as.DocumentTermMatrix(x, p_attribute = NULL, col = NULL, verbose = TRUE, ...)
## S4 method for signature 'context'
as.DocumentTermMatrix(x, p_attribute, verbose = TRUE, ...)
## S4 method for signature 'context'
as.TermDocumentMatrix(x, p_attribute, verbose = TRUE, ...)
Arguments
x |
A |
... |
Definitions of s-attribute used for subsetting the corpus, compare partition-method. |
p_attribute |
A p-attribute counting is be based on. |
s_attribute |
An s-attribute that defines content of columns, or rows. |
verbose |
A |
stoplist |
A |
binarize |
A |
col |
The column of |
Details
If x
refers to a corpus (i.e. is a length 1 character vector), a
TermDocumentMatrix
, or DocumentTermMatrix
will be generated for
subsets of the corpus based on the s_attribute
provided. Counts are
performed for the p_attribute
. Further parameters provided (passed in
as ...
are interpreted as s-attributes that define a subset of the
corpus for splitting it according to s_attribute
. If struc values for
s_attribute
are not unique, the necessary aggregation is performed, slowing
things somewhat down.
If x
is a bundle
or a class inheriting from it, the counts or
whatever measure is present in the stat
slots (in the column
indicated by col
) will be turned into the values of the sparse
matrix that is generated. A special case is the generation of the sparse
matrix based on a partition_bundle
that does not yet include counts.
In this case, a p_attribute
needs to be provided. Then counting will
be performed, too.
If x
is a partition_bundle
, and argument col
is
not NULL
, as TermDocumentMatrix
is generated based on the
column indicated by col
of the data.table
with counts in the
stat
slots of the objects in the bundle. If col
is
NULL
, the p-attribute indicated by p_attribute
is decoded,
and a count is performed to obtain the values of the resulting
TermDocumentMatrix
. The same procedure applies to get a
DocumentTermMatrix
.
If x
is a subcorpus_bundle
, the p-attribute provided
by argument p_attribute
is decoded, and a count is performed to
obtain the resulting TermDocumentMatrix
or
DocumentTermMatrix
.
Value
A TermDocumentMatrix
, or a DocumentTermMatrix
object.
These classes are defined in the tm
package, and inherit from the
simple_triplet_matrix
-class defined in the slam
-package.
Author(s)
Andreas Blaette
Examples
# examples not run by default to save time on CRAN test machines
#' use(pkg = "RcppCWB", corpus = "REUTERS")
# enriching partition_bundle explicitly
tdm <- corpus("REUTERS") %>%
partition_bundle(s_attribute = "id") %>%
enrich(p_attribute = "word") %>%
as.TermDocumentMatrix(col = "count")
# leave the counting to the as.TermDocumentMatrix-method
tdm <- partition_bundle("REUTERS", s_attribute = "id") %>%
as.TermDocumentMatrix(p_attribute = "word", verbose = FALSE)
# obtain TermDocumentMatrix directly (fastest option)
tdm <- as.TermDocumentMatrix(
"REUTERS",
p_attribute = "word",
s_attribute = "id",
verbose = FALSE
)
# workflow using split()
dtm <- corpus("REUTERS") %>%
split(s_attribute = "id") %>%
as.TermDocumentMatrix(p_attribute = "word")