R: Document vectorization

vectorize.docs {fdm2id}

R Documentation

Document vectorization

Description

Vectorize a corpus of documents.

Usage

vectorize.docs(
  vectorizer = NULL,
  corpus = NULL,
  lang = "en",
  stopwords = lang,
  ngram = 1,
  mincount = 10,
  minphrasecount = NULL,
  transform = c("tfidf", "lsa", "l1", "none"),
  latentdim = 50,
  returndata = TRUE,
  ...
)

Arguments

`vectorizer`	The document vectorizer.
`corpus`	The corpus of documents (a vector of characters).
`lang`	The language of the documents (NULL if no stemming).
`stopwords`	Stopwords, or the language of the documents. NULL if stop words should not be removed.
`ngram`	maximum size of n-grams.
`mincount`	Minimum word count to be considered as frequent.
`minphrasecount`	Minimum collocation of words count to be considered as frequent.
`transform`	Transformation (TF-IDF, LSA, L1 normanization, or nothing).
`latentdim`	Number of latent dimensions if LSA transformation is performed.
`returndata`	If true, the vectorized documents are returned. If false, a "vectorizer" is returned.
`...`	Other parameters.

Value

The vectorized documents.

Examples

## Not run: 
require (text2vec)
data ("movie_review")
# Clustering
docs = vectorize.docs (corpus = movie_review$review, transform = "tfidf")
km = KMEANS (docs [sample (nrow (docs), 100), ], k = 10)
# Classification
d = movie_review [, 2:3]
d [, 1] = factor (d [, 1])
d = splitdata (d, 1)
vectorizer = vectorize.docs (corpus = d$train.x,
                             returndata = FALSE, mincount = 50)
train = vectorize.docs (corpus = d$train.x, vectorizer = vectorizer)
test = vectorize.docs (corpus = d$test.x, vectorizer = vectorizer)
model = NB (as.matrix (train), d$train.y)
pred = predict (model, as.matrix (test))
evaluation (pred, d$test.y)

## End(Not run)

[Package fdm2id version 0.9.9 Index]

Document vectorization

Description

Usage

Arguments

Value

See Also

Examples