R: Generate a document-term matrix from a corpus object

docTermMatrix,kRp.corpus-method {tm.plugin.koRpus}

R Documentation

Generate a document-term matrix from a corpus object

Description

Calculates a sparse document-term matrix calculated from a given object of class kRp.corpus and adds it to the object's feature list. You can also calculate the term frequency inverted document frequency value (tf-idf) for each term.

Usage

## S4 method for signature 'kRp.corpus'
docTermMatrix(
  obj,
  terms = "token",
  case.sens = FALSE,
  tfidf = FALSE,
  as.feature = TRUE
)

Arguments

`obj`	An object of class `kRp.corpus`.
`terms`	A character string defining the `tokens` column to be used for calculating the matrix.
`case.sens`	Logical, whether terms should be counted case sensitive.
`tfidf`	Logical, if `TRUE` calculates term frequency–inverse document frequency (tf-idf) values instead of absolute frequency.
`as.feature`	Logical, whether the output should be just the sparse matrix or the input object with that matrix added as a feature. Use `corpusDocTermMatrix` to get the matrix from such an aggregated object.

Details

The settings of terms, case.sens, and tfidf will be stored in the object's meta slot, so you can use corpusMeta(..., "doc_term_matrix") to fetch it.

See the examples to learn how to limit the analysis to desired word classes.

Value

Either an object of the input class or a sparse matrix of class dgCMatrix.

Examples

# use readCorpus() to create an object of class kRp.corpus
# code is only run when the english language package can be loaded
if(require("koRpus.lang.en", quietly = TRUE)){
  myCorpus <- readCorpus(
    dir=file.path(path.package("tm.plugin.koRpus"), "examples", "corpus"),
    hierarchy=list(
      Topic=c(
        Winner="Reality Winner",
        Edwards="Natalie Edwards"
      ),
      Source=c(
        Wikipedia_prev="Wikipedia (old)",
        Wikipedia_new="Wikipedia (new)"
      )
    ),
    # use tokenize() so examples run without a TreeTagger installation
    tagger="tokenize",
    lang="en"
  )

  # get the document-term frequencies in a sparse matrix
  myDTMatrix <- docTermMatrix(myCorpus, as.feature=FALSE)

  # combine with filterByClass() to, e.g.,  exclude all punctuation
  myDTMatrix <- docTermMatrix(filterByClass(myCorpus), as.feature=FALSE)

  # instead of absolute frequencies, get the tf-idf values
  myDTMatrix <- docTermMatrix(
    filterByClass(myCorpus),
    tfidf=TRUE,
    as.feature=FALSE
  )
} else {}

[Package tm.plugin.koRpus version 0.4-2 Index]