R: Mine Text

mine_text {inlpubs}

R Documentation

Mine Text

Description

Performs a term frequency text analysis. A term is defined as a word or group of words.

Usage

mine_text(docs, ngmin = 1, ngmax = ngmin, sparse = NULL)

Arguments

`docs`	'list' or 'character' vector. Document text to analyze. Each list item contains the extracted text from a single document.
`ngmin`, `ngmax`	integer number. Splits strings into n-grams with given minimal and maximal numbers of grams. An n-gram is an ordered sequence of n words taken from the body of a text. Requires the RWeka package is available and that the environment variable JAVA_HOME points to where the Java software is located. Recommended for single text compoents only.
`sparse`	'numeric' number that is greater than 0 and less than 1. A threshold of relative document frequency for a term. It specifies the proportion of documents in which a term must appear to be retained. For example if you specify `sparse` equal to 0.99, it removes terms that are more sparse than 0.99. Conversely, at 0.01, only terms appearing in nearly every document will be retained.

Details

HTML entities are decoded when the textutils package is available.

Value

A term-frequency data table giving the number of times each word occurs in the text. A column in the table represents a single component in the docs argument, and each row provides frequency counts for a particular word (also known as a 'term').

Author(s)

J.C. Fisher, U.S. Geological Survey, Idaho Water Science Center

Examples

d <- c(
  "The quick brown fox jumps over the lazy lazy dog.",
  "Pack my brown box.",
  "Jazz fly brown dog."
) |>
  mine_text()

d <- list(
  "A" = "The quick brown fox jumps over the lazy lazy dog.",
  "B" = c("Pack my brown box.", NA, "Jazz fly brown dog."),
  "C" = NA_character_
) |>
  mine_text()

[Package inlpubs version 1.1.3 Index]