getvocab {fdm2id} | R Documentation |
Extract words and phrases from a corpus
Description
Extract words and phrases from a corpus of documents.
Usage
getvocab(
corpus,
mincount = 5,
minphrasecount = NULL,
ngram = 1,
lang = "en",
stopwords = lang,
...
)
Arguments
corpus |
The corpus of documents (a vector of characters). |
mincount |
Minimum word count to be considered as frequent. |
minphrasecount |
Minimum collocation of words count to be considered as frequent. |
ngram |
maximum size of n-grams. |
lang |
The language of the documents (NULL if no stemming). |
stopwords |
Stopwords, or the language of the documents. NULL if stop words should not be removed. |
... |
Other parameters. |
Value
The vocabulary used in the corpus of documents.
See Also
plotzipf
, stopwords
, create_vocabulary
Examples
## Not run:
text = loadtext ("http://mattmahoney.net/dc/text8.zip")
vocab1 = getvocab (text) # With stemming
nrow (vocab1)
vocab2 = getvocab (text, lang = NULL) # Without stemming
nrow (vocab2)
## End(Not run)
[Package fdm2id version 0.9.9 Index]