R: Convert a character vector to a term co-occurrence matrix.

CreateTcm {textmineR}

R Documentation

Convert a character vector to a term co-occurrence matrix.

Description

This is the main term co-occurrence matrix creating function for textmineR. In most cases, all you need to do is import documents as a character vector in R and then run this function to get a term co-occurrence matrix that is compatible with the rest of textmineR's functionality and many other libraries. CreateTcm is built on top of the excellent text2vec library.

Usage

CreateTcm(
  doc_vec,
  skipgram_window = Inf,
  ngram_window = c(1, 1),
  stopword_vec = c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart")),
  lower = TRUE,
  remove_punctuation = TRUE,
  remove_numbers = TRUE,
  stem_lemma_function = NULL,
  verbose = FALSE,
  ...
)

Arguments

`doc_vec`	A character vector of documents.
`skipgram_window`	An integer window, from `0` to `Inf` for skip-grams. Defaults to `Inf`. See 'Details', below.
`ngram_window`	A numeric vector of length 2. The first entry is the minimum n-gram size; the second entry is the maximum n-gram size. Defaults to `c(1, 1)`. Must be `c(1, 1)` if `skipgram_window` is not `0` or `Inf`.
`stopword_vec`	A character vector of stopwords you would like to remove. Defaults to `c(stopwords::stopwords("en"), stopwords::stopwords(source = "smart"))`. If you do not want stopwords removed, specify `stopword_vec = c()`.
`lower`	Do you want all words coerced to lower case? Defaults to `TRUE`
`remove_punctuation`	Do you want to convert all non-alpha numeric characters to spaces? Defaults to `TRUE`
`remove_numbers`	Do you want to convert all numbers to spaces? Defaults to `TRUE`
`stem_lemma_function`	A function that you would like to apply to the documents for stemming, lemmatization, or similar. See examples for usage.
`verbose`	Defaults to `TRUE`. Do you want to see status during vectorization?
`...`	Other arguments to be passed to `TmParallelApply`.

Details

Setting skipgram_window counts the number of times that term j appears within skipgram_window places of term i. Inf and 0 create somewhat special TCMs. Setting skipgram_window to Inf counts the number of documents in which term j and term i occur together. Setting skipgram_window to 0 counts the number of terms shared by document j and document i. A TCM where skipgram_window is 0 is the only TCM that will be symmetric.

Value

A document term matrix of class dgCMatrix. The rows index documents. The columns index terms. The i, j entries represent the count of term j appearing in document i.

Note

The following transformations are applied to stopword_vec as well as doc_vec: lower, remove_punctuation, remove_numbers

See stopwords for details on the default to the stopword_vec argument.

Examples

## Not run: 
data(nih_sample)

# TCM of unigrams and bigrams
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = Inf, 
                 ngram_window = c(1, 2))

# TCM of unigrams and a skip=gram window of 3, applying Porter's word stemmer
tcm <- CreateTcm(doc_vec = nih_sample$ABSTRACT_TEXT,
                 skipgram_window = 3,
                 stem_lemma_function = function(x) SnowballC::wordStem(x, "porter"))

## End(Not run)

[Package textmineR version 3.0.5 Index]