docTermMatrix,kRp.corpus-method {tm.plugin.koRpus}R Documentation

Generate a document-term matrix from a corpus object

Description

Calculates a sparse document-term matrix calculated from a given object of class kRp.corpus and adds it to the object's feature list. You can also calculate the term frequency inverted document frequency value (tf-idf) for each term.

Usage

## S4 method for signature 'kRp.corpus'
docTermMatrix(
  obj,
  terms = "token",
  case.sens = FALSE,
  tfidf = FALSE,
  as.feature = TRUE
)

Arguments

obj

An object of class kRp.corpus.

terms

A character string defining the tokens column to be used for calculating the matrix.

case.sens

Logical, whether terms should be counted case sensitive.

tfidf

Logical, if TRUE calculates term frequency–inverse document frequency (tf-idf) values instead of absolute frequency.

as.feature

Logical, whether the output should be just the sparse matrix or the input object with that matrix added as a feature. Use corpusDocTermMatrix to get the matrix from such an aggregated object.

Details

The settings of terms, case.sens, and tfidf will be stored in the object's meta slot, so you can use corpusMeta(..., "doc_term_matrix") to fetch it.

See the examples to learn how to limit the analysis to desired word classes.

Value

Either an object of the input class or a sparse matrix of class dgCMatrix.

Examples

# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  myCorpus <- readCorpus(
    dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
    hierarchy=list(
      Topic=c(
        Winner="Reality Winner",
        Edwards="Natalie Edwards"
      ),
      Source=c(
        Wikipedia_prev="Wikipedia (old)",
        Wikipedia_new="Wikipedia (new)"
      )
    ),
    # use tokenize() so examples run without a TreeTagger installation
    tagger="tokenize",
    lang="en"
  )

  # get the document-term frequencies in a sparse matrix
  myDTMatrix <- docTermMatrix(myCorpus, as.feature=FALSE)

  # combine with filterByClass() to, e.g.,  exclude all punctuation
  myDTMatrix <- docTermMatrix(filterByClass(myCorpus), as.feature=FALSE)

  # instead of absolute frequencies, get the tf-idf values
  myDTMatrix <- docTermMatrix(
    filterByClass(myCorpus),
    tfidf=TRUE,
    as.feature=FALSE
  )
} else {}

[Package tm.plugin.koRpus version 0.4-2 Index]