polmineR-package {polmineR}R Documentation

polmineR-package

Description

A library for corpus analysis using the Corpus Workbench (CWB) as an efficient back end for indexing and querying large corpora.

Usage

polmineR()

Details

The package offers functionality to flexibly create partitions and to carry out basic statistical operations (count, co-occurrences etc.). The original full text of documents can be reconstructed and inspected at any time. Beyond that, the package is intended to serve as an interface to packages implementing advanced statistical procedures. Respective data structures (document term matrices, term co- occurrence matrices etc.) can be created based on the indexed corpora.

A session registry directory (see registry()) combines the registry files for corpora that may reside in anywhere on the system. Upon loading 'polmineR', the files in the registry directory defined by the environment variable CORPUS_REGISTRY are copied to the session registry directory. To see whether the environment variable CORPUS_REGISTRY is set, use the Sys.getenv()-function. Corpora wrapped in R data packages can be activated using the function use().

The package includes a draft shiny app that can be called using polmineR().

Package options

Author(s)

Andreas Blaette (andreas.blaette@uni-due.de)

References

Jockers, Matthew L. (2014): Text Analysis with R for Students of Literature. Cham et al: Springer.

Baker, Paul (2006): Using Corpora in Discourse Analysis. London: continuum.

Examples

# The REUTERS corpus included in the RcppCWB package is used in examples
use(pkg = "RcppCWB", corpus = "REUTERS") # activate REUTERS corpus
r <- corpus("REUTERS")
if (interactive()) show_info(r)

# The package includes GERMAPARLMINI as sample data
use("polmineR") # activate GERMAPARLMINI
gparl <- corpus("GERMAPARLMINI")
if (interactive()) show_info(gparl)

# Core methods

count("REUTERS", query = "oil")
count("REUTERS", query = c("oil", "barrel"))
count("REUTERS", query = '"Saudi" "Arab.*"', breakdown = TRUE, cqp = TRUE)
dispersion("REUTERS", query = "oil", s_attribute = "id")
k <- kwic("REUTERS", query = "oil")
coocs <- cooccurrences("REUTERS", query = "oil")


# Core methods applied to partition

kuwait <- partition("REUTERS", places = "kuwait", regex = TRUE)
C <- count(kuwait, query = "oil")
D <- dispersion(kuwait, query = "oil", s_attribute = "id")
K <- kwic(kuwait, query = "oil", meta = "id")
CO <- cooccurrences(kuwait, query = "oil")


# Go back to full text

p <- partition("REUTERS", id = 127)
if (interactive()) read(p)
h <- html(p) %>%
  highlight(highlight = list(yellow = "oil"))
if (interactive()) h_highlighted


# Generate term document matrix (not run by default to save time)

pb <- partition_bundle("REUTERS", s_attribute = "id")
cnt <- count(pb, p_attribute = "word")
tdm <- as.TermDocumentMatrix(cnt, col = "count")


[Package polmineR version 0.8.9 Index]