build_dtm {R.temis} | R Documentation |
build_dtm
Description
Compute document-term matrix from a corpus.
Usage
build_dtm(
corpus,
sparsity = 1,
dictionary = NULL,
remove_stopwords = FALSE,
tolower = TRUE,
remove_punctuation = TRUE,
remove_numbers = TRUE,
min_length = 2
)
Arguments
corpus |
A |
sparsity |
Value between 0 and 1 indicating the proportion of documents
with no occurrences of a term above which that term should be dropped. By default
all terms are kept ( |
dictionary |
A vector of terms to which the matrix should be restricted.
By default, all words with more than |
remove_stopwords |
Whether to remove stopwords appearing in a language-specific list
(see |
tolower |
Whether to convert all text to lower case. |
remove_punctuation |
Whether to remove all punctuation from text before tokenizing terms. |
remove_numbers |
Whether to remove all numbers from text before tokenizing terms. |
min_length |
The minimal number of characters for a word to be retained. |
Value
A DocumentTermMatrix
object.
Examples
file <- system.file("texts", "reut21578-factiva.xml", package="tm.plugin.factiva")
corpus <- import_corpus(file, "factiva", language="en")
build_dtm(corpus)
[Package R.temis version 0.1.3 Index]