| vectorize.docs {fdm2id} | R Documentation | 
Document vectorization
Description
Vectorize a corpus of documents.
Usage
vectorize.docs(
  vectorizer = NULL,
  corpus = NULL,
  lang = "en",
  stopwords = lang,
  ngram = 1,
  mincount = 10,
  minphrasecount = NULL,
  transform = c("tfidf", "lsa", "l1", "none"),
  latentdim = 50,
  returndata = TRUE,
  ...
)
Arguments
| vectorizer | The document vectorizer. | 
| corpus | The corpus of documents (a vector of characters). | 
| lang | The language of the documents (NULL if no stemming). | 
| stopwords | Stopwords, or the language of the documents. NULL if stop words should not be removed. | 
| ngram | maximum size of n-grams. | 
| mincount | Minimum word count to be considered as frequent. | 
| minphrasecount | Minimum collocation of words count to be considered as frequent. | 
| transform | Transformation (TF-IDF, LSA, L1 normanization, or nothing). | 
| latentdim | Number of latent dimensions if LSA transformation is performed. | 
| returndata | If true, the vectorized documents are returned. If false, a "vectorizer" is returned. | 
| ... | Other parameters. | 
Value
The vectorized documents.
See Also
query.docs, stopwords, vectorizers
Examples
## Not run: 
require (text2vec)
data ("movie_review")
# Clustering
docs = vectorize.docs (corpus = movie_review$review, transform = "tfidf")
km = KMEANS (docs [sample (nrow (docs), 100), ], k = 10)
# Classification
d = movie_review [, 2:3]
d [, 1] = factor (d [, 1])
d = splitdata (d, 1)
vectorizer = vectorize.docs (corpus = d$train.x,
                             returndata = FALSE, mincount = 50)
train = vectorize.docs (corpus = d$train.x, vectorizer = vectorizer)
test = vectorize.docs (corpus = d$test.x, vectorizer = vectorizer)
model = NB (as.matrix (train), d$train.y)
pred = predict (model, as.matrix (test))
evaluation (pred, d$test.y)
## End(Not run)
[Package fdm2id version 0.9.9 Index]