R: Compute term frequencies from a vector of text

compute_term_frequency {cranly}

R Documentation

Compute term frequencies from a vector of text

Description

Compute term frequencies from a vector of text

Usage

compute_term_frequency(
  txt,
  ignore_words = c("www.jstor.org", "www.arxiv.org", "arxiv.org", "provides", "https"),
  stem = FALSE,
  remove_punctuation = TRUE,
  remove_stopwords = TRUE,
  remove_numbers = TRUE,
  to_lower = TRUE,
  frequency = "term"
)

Arguments

`txt`	a vector of character strings.
`ignore_words`	a vector of words to be ignored when forming the corpus.
`stem`	should words be stemmed using Porter's stemming algorithm? Default is `FALSE`. See `tm::stemDocument()`.
`remove_punctuation`	should punctuation be removed when forming the corpus? Default is `TRUE`. See `tm::removePunctuation()`.
`remove_stopwords`	should english stopwords be removed when forming the corpus? Default is `TRUE`. See tm::removeWords and tm::stopwords.
`remove_numbers`	should numbers be removed when forming the corpus? Default is `TRUE`. See tm::removeNumbers.
`to_lower`	should all terms be coerced to lower-case when forming the corpus? Default is `TRUE`.
`frequency`	the type of term frequencies to return. Options are `"term"` (default; a named vector of term frequencies), `"document-term"` (a document-term frequency matrix; see `tm::TermDocumentMatrix()`), `"term-document"` (a term-document frequency matrix; see `tm::DocumentTermMatrix()`). The operations are taking place as follows: remove special characters, covert to lower-case (depending on the values of `to_lower`), remove numbers (depending on the value of `remove_numbers`), remove stop words (depending on the value of `remove_stopwords`), remove custom words (depending on the value of `ignore_words`), remove punctuation (depending on the value of `remove_punctuation`), clean up any leading or trailing whitespace, and, finally stem words (depending on the value of `stem`).

Details

If txt is a named vector then the names are used as document id's when forming the corpus.

Value

Either a named numeric vector (frequency = "term"), or an object of class tm::DocumentTermMatrix (frequency = "document-term"), or or an object of class tm::TermDocumentMatrix (frequency = "term-document").

Compute term frequencies from a vector of text

Description

Usage

Arguments

Details

Value

See Also