vectorize.docs {fdm2id} | R Documentation |
Document vectorization
Description
Vectorize a corpus of documents.
Usage
vectorize.docs(
vectorizer = NULL,
corpus = NULL,
lang = "en",
stopwords = lang,
ngram = 1,
mincount = 10,
minphrasecount = NULL,
transform = c("tfidf", "lsa", "l1", "none"),
latentdim = 50,
returndata = TRUE,
...
)
Arguments
vectorizer |
The document vectorizer. |
corpus |
The corpus of documents (a vector of characters). |
lang |
The language of the documents (NULL if no stemming). |
stopwords |
Stopwords, or the language of the documents. NULL if stop words should not be removed. |
ngram |
maximum size of n-grams. |
mincount |
Minimum word count to be considered as frequent. |
minphrasecount |
Minimum collocation of words count to be considered as frequent. |
transform |
Transformation (TF-IDF, LSA, L1 normanization, or nothing). |
latentdim |
Number of latent dimensions if LSA transformation is performed. |
returndata |
If true, the vectorized documents are returned. If false, a "vectorizer" is returned. |
... |
Other parameters. |
Value
The vectorized documents.
See Also
query.docs
, stopwords
, vectorizers
Examples
## Not run:
require (text2vec)
data ("movie_review")
# Clustering
docs = vectorize.docs (corpus = movie_review$review, transform = "tfidf")
km = KMEANS (docs [sample (nrow (docs), 100), ], k = 10)
# Classification
d = movie_review [, 2:3]
d [, 1] = factor (d [, 1])
d = splitdata (d, 1)
vectorizer = vectorize.docs (corpus = d$train.x,
returndata = FALSE, mincount = 50)
train = vectorize.docs (corpus = d$train.x, vectorizer = vectorizer)
test = vectorize.docs (corpus = d$test.x, vectorizer = vectorizer)
model = NB (as.matrix (train), d$train.y)
pred = predict (model, as.matrix (test))
evaluation (pred, d$test.y)
## End(Not run)
[Package fdm2id version 0.9.9 Index]