R: Manage and use phrases

phrases-class {polmineR}

R Documentation

Manage and use phrases

Description

Class, methods and functionality for processing phrases (lexical units, lexical items, multi-word expressions) beyond the token level. The envisaged workflow at this stage is to detect phrases using the ngrams-method and to generate a phrases class object from the ngrams object using the as.phrases method. This object can be passed into a call of count, see examples. Further methods and functions documented here are used internally, but may be useful.

Usage

## S4 method for signature 'ngrams'
as.phrases(.Object)

## S4 method for signature 'matrix'
as.phrases(.Object, corpus, enc = encoding(corpus))

## S4 method for signature 'phrases'
as.character(x, p_attribute)

concatenate_phrases(dt, phrases, col)

Arguments

`.Object`	Input object, either a `ngrams` or a `matrix` object.
`corpus`	A length-one `character` vector, the corpus ID of the corpus from which regions / the `data.table` representing a decoded corpus is derived.
`enc`	Encoding of the corpus.
`x`	A `phrases` class object.
`p_attribute`	The positional attribute (p-attribute) to decode.
`dt`	A `data.table`.
`phrases`	A `phrases` class object.
`col`	If `.Object` is a `data.table`, the column to concatenate.

Details

The phrases considers a phrase as sequence as tokens that can be defined by region, i.e. a left and a right corpus position. This information is kept in a region matrix in the slot "cpos" of the phrases class. The phrases class inherits from the regions class (which inherits from the and the corpus class), without adding further slots.

If .Object is an object of class ngrams, the as.phrases()-method will interpret the ngrams as CQP queries, look up the matching corpus positions and return an phrases object.

If .Object is a matrix, the as.phrases()-method will initialize a phrases object. The corpus and the encoding of the corpus will be assigned to the object.

Applying the as.character-method on a phrases object will return the decoded regions, concatenated using an underscore as seperator.

The concatenate_phrases function takes a data.table (argument dt) as input and concatenates phrases in successive rows into a phrase.

Examples

## Not run: 
# Workflow to create document-term-matrix with phrases

obs <- corpus("GERMAPARLMINI") %>%
  count(p_attribute = "word")

phrases <- corpus("GERMAPARLMINI") %>%
  ngrams(n = 2L, p_attribute = "word") %>%
  pmi(observed = obs) %>% 
  subset(ngram_count > 5L) %>%
  subset(1:100) %>%
  as.phrases()

dtm <- corpus("GERMAPARLMINI") %>%
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "date", progress = TRUE) %>%
  count(phrases = phrases, p_attribute = "word", progress = TRUE, verbose = TRUE) %>%
  as.DocumentTermMatrix(col = "count", verbose = FALSE)
  
grep("erneuerbaren_Energien", colnames(dtm))
grep("verpasste_Chancen", colnames(dtm))

## End(Not run)
## Not run: 
use(pkg = "RcppCWB", corpus = "REUTERS")

# Derive phrases object from an ngrams object

reuters_phrases <- ngrams("REUTERS", p_attribute = "word", n = 2L) %>%
  pmi(observed = count("REUTERS", p_attribute = "word")) %>%
  subset(ngram_count >= 5L) %>%
  subset(1:25) %>%
  as.phrases()

phr <- as.character(reuters_phrases, p_attribute = "word")

## End(Not run)
# Derive phrases from explicitly stated CQP queries

## Not run: 
cqp_phrase_queries <- c(
  '"oil" "revenue";',
  '"Sheikh" "Aziz";',
  '"Abdul" "Aziz";',
  '"Saudi" "Arabia";',
  '"oil" "markets";'
)
reuters_phrases <- cpos("REUTERS", cqp_phrase_queries, p_attribute = "word") %>%
  as.phrases(corpus = "REUTERS", enc = "latin1")

## End(Not run)
  
# Use the concatenate_phrases() function on a data.table

## Not run: 
#' lexical_units_cqp <- c(
  '"Deutsche.*" "Bundestag.*";',
  '"sozial.*" "Gerechtigkeit";',
  '"Ausschuss" "f.r" "Arbeit" "und" "Soziales";',
  '"soziale.*" "Marktwirtschaft";',
  '"freiheitliche.*" "Grundordnung";'
)

phr <- cpos("GERMAPARLMINI", query = lexical_units_cqp, cqp = TRUE) %>%
  as.phrases(corpus = "GERMAPARLMINI", enc = "word")

dt <- corpus("GERMAPARLMINI") %>%
  decode(p_attribute = "word", s_attribute = character(), to = "data.table") %>%
  concatenate_phrases(phrases = phr, col = "word")
  
dt[word == "Deutschen_Bundestag"]
dt[word == "soziale_Marktwirtschaft"]

## End(Not run)