phrases-class {polmineR} | R Documentation |
Manage and use phrases
Description
Class, methods and functionality for processing phrases (lexical
units, lexical items, multi-word expressions) beyond the token level. The
envisaged workflow at this stage is to detect phrases using the
ngrams
-method and to generate a phrases
class object from the
ngrams
object using the as.phrases
method. This object can be
passed into a call of count
, see examples. Further methods and
functions documented here are used internally, but may be useful.
Usage
## S4 method for signature 'ngrams'
as.phrases(.Object)
## S4 method for signature 'matrix'
as.phrases(.Object, corpus, enc = encoding(corpus))
## S4 method for signature 'phrases'
as.character(x, p_attribute)
concatenate_phrases(dt, phrases, col)
Arguments
.Object |
Input object, either a |
corpus |
A length-one |
enc |
Encoding of the corpus. |
x |
A |
p_attribute |
The positional attribute (p-attribute) to decode. |
dt |
A |
phrases |
A |
col |
If |
Details
The phrases
considers a phrase as sequence as tokens that can
be defined by region, i.e. a left and a right corpus position. This
information is kept in a region matrix in the slot "cpos" of the
phrases
class. The phrases
class inherits from the
regions
class (which inherits from the and the
corpus
class), without adding further slots.
If .Object
is an object of class ngrams
, the
as.phrases()
-method will interpret the ngrams as CQP queries,
look up the matching corpus positions and return an phrases
object.
If .Object
is a matrix
, the as.phrases()
-method will
initialize a phrases
object. The corpus and the encoding of the corpus
will be assigned to the object.
Applying the as.character
-method on a phrases
object
will return the decoded regions, concatenated using an underscore as
seperator.
The concatenate_phrases
function takes a data.table
(argument dt
) as input and concatenates phrases in successive rows
into a phrase.
See Also
Other classes to manage corpora:
corpus-class
,
ranges-class
,
regions
,
subcorpus
Examples
## Not run:
# Workflow to create document-term-matrix with phrases
obs <- corpus("GERMAPARLMINI") %>%
count(p_attribute = "word")
phrases <- corpus("GERMAPARLMINI") %>%
ngrams(n = 2L, p_attribute = "word") %>%
pmi(observed = obs) %>%
subset(ngram_count > 5L) %>%
subset(1:100) %>%
as.phrases()
dtm <- corpus("GERMAPARLMINI") %>%
as.speeches(s_attribute_name = "speaker", s_attribute_date = "date", progress = TRUE) %>%
count(phrases = phrases, p_attribute = "word", progress = TRUE, verbose = TRUE) %>%
as.DocumentTermMatrix(col = "count", verbose = FALSE)
grep("erneuerbaren_Energien", colnames(dtm))
grep("verpasste_Chancen", colnames(dtm))
## End(Not run)
## Not run:
use(pkg = "RcppCWB", corpus = "REUTERS")
# Derive phrases object from an ngrams object
reuters_phrases <- ngrams("REUTERS", p_attribute = "word", n = 2L) %>%
pmi(observed = count("REUTERS", p_attribute = "word")) %>%
subset(ngram_count >= 5L) %>%
subset(1:25) %>%
as.phrases()
phr <- as.character(reuters_phrases, p_attribute = "word")
## End(Not run)
# Derive phrases from explicitly stated CQP queries
## Not run:
cqp_phrase_queries <- c(
'"oil" "revenue";',
'"Sheikh" "Aziz";',
'"Abdul" "Aziz";',
'"Saudi" "Arabia";',
'"oil" "markets";'
)
reuters_phrases <- cpos("REUTERS", cqp_phrase_queries, p_attribute = "word") %>%
as.phrases(corpus = "REUTERS", enc = "latin1")
## End(Not run)
# Use the concatenate_phrases() function on a data.table
## Not run:
#' lexical_units_cqp <- c(
'"Deutsche.*" "Bundestag.*";',
'"sozial.*" "Gerechtigkeit";',
'"Ausschuss" "f.r" "Arbeit" "und" "Soziales";',
'"soziale.*" "Marktwirtschaft";',
'"freiheitliche.*" "Grundordnung";'
)
phr <- cpos("GERMAPARLMINI", query = lexical_units_cqp, cqp = TRUE) %>%
as.phrases(corpus = "GERMAPARLMINI", enc = "word")
dt <- corpus("GERMAPARLMINI") %>%
decode(p_attribute = "word", s_attribute = character(), to = "data.table") %>%
concatenate_phrases(phrases = phr, col = "word")
dt[word == "Deutschen_Bundestag"]
dt[word == "soziale_Marktwirtschaft"]
## End(Not run)