get_token_stream {polmineR} | R Documentation |
Get Token Stream.
Description
Auxiliary method to get the fulltext of a corpus, subcorpora etc. Can be used to export corpus data to other tools.
Usage
get_token_stream(.Object, ...)
## S4 method for signature 'numeric'
get_token_stream(
.Object,
corpus,
registry = NULL,
p_attribute,
subset = NULL,
boost = NULL,
encoding = NULL,
collapse = NULL,
beautify = TRUE,
cpos = FALSE,
cutoff = NULL,
decode = TRUE,
...
)
## S4 method for signature 'matrix'
get_token_stream(.Object, corpus, registry = NULL, split = FALSE, ...)
## S4 method for signature 'corpus'
get_token_stream(.Object, left = NULL, right = NULL, ...)
## S4 method for signature 'character'
get_token_stream(.Object, left = NULL, right = NULL, ...)
## S4 method for signature 'slice'
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)
## S4 method for signature 'partition'
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)
## S4 method for signature 'subcorpus'
get_token_stream(.Object, p_attribute, collapse = NULL, cpos = FALSE, ...)
## S4 method for signature 'regions'
get_token_stream(
.Object,
p_attribute = "word",
collapse = NULL,
cpos = FALSE,
split = FALSE,
...
)
## S4 method for signature 'partition_bundle'
get_token_stream(
.Object,
p_attribute = "word",
vocab = NULL,
phrases = NULL,
subset = NULL,
min_length = NULL,
collapse = NULL,
cpos = FALSE,
decode = TRUE,
beautify = FALSE,
verbose = TRUE,
progress = FALSE,
mc = FALSE,
...
)
Arguments
.Object |
Input object. |
... |
Arguments that will be be passed into the
|
corpus |
A CWB indexed corpus. |
registry |
Registry directory with registry file describing the corpus. |
p_attribute |
A |
subset |
An expression applied on p-attributes, using non-standard evaluation. Note that symbols used in the expression may not be used internally (e.g. 'stopwords'). |
boost |
A length-one |
encoding |
If not |
collapse |
If not |
beautify |
A (length-one) |
cpos |
A |
cutoff |
Maximum number of tokens to be reconstructed. |
decode |
A (length-one) |
split |
A |
left |
Left corpus position. |
right |
Right corpus position. |
vocab |
A |
phrases |
A |
min_length |
If not |
verbose |
A length-one |
progress |
A length-one |
mc |
Number of cores to use. If |
Details
CWB indexed corpora have a fixed order of tokens which is called the
token stream. Every token is assigned to a unique corpus
position, Subsets of the (entire) token stream defined by a left and a
right corpus position are called regions. The
get_token_stream
-method will extract the tokens (for regions) from a
corpus.
The primary usage of this method is to return the token stream of a
(sub-)corpus as defined by a corpus
, subcorpus
or partition
object.
The methods defined for a numeric
vector or a (two-column) matrix
defining regions (i.e. left and right corpus positions in the first and
second column) are the actual workers for this operation.
The get_token_stream
has been introduced so serve as a worker by
higher level methods such as read
, html
, and as.markdown
. It may
however be useful for decoding a corpus so that it can be exported to other
tools.
Examples
use(pkg = "RcppCWB", corpus = "REUTERS")
# Decode first words of REUTERS corpus (first sentence)
get_token_stream(0:20, corpus = "REUTERS", p_attribute = "word")
# Decode first sentence and collapse tokens into single string
get_token_stream(0:20, corpus = "REUTERS", p_attribute = "word", collapse = " ")
# Decode regions defined by two-column integer matrix
region_matrix <- matrix(c(0L,20L,21L,38L), ncol = 2, byrow = TRUE)
get_token_stream(
region_matrix,
corpus = "REUTERS",
p_attribute = "word",
encoding = "latin1"
)
# Use argument 'beautify' to remove surplus whitespace
## Not run:
get_token_stream(
region_matrix,
corpus = "GERMAPARLMINI",
p_attribute = "word",
encoding = "latin1",
collapse = " ", beautify = TRUE
)
## End(Not run)
# Decode entire corpus (corpus object / specified by corpus ID)
corpus("REUTERS") %>%
get_token_stream(p_attribute = "word") %>%
head()
# Decode subcorpus
corpus("REUTERS") %>%
subset(id == "127") %>%
get_token_stream(p_attribute = "word") %>%
head()
# Decode partition_bundle
## Not run:
pb_tokstr <- corpus("REUTERS") %>%
split(s_attribute = "id") %>%
get_token_stream(p_attribute = "word")
## End(Not run)
## Not run:
# Get token stream for partition_bundle
pb <- partition_bundle("REUTERS", s_attribute = "id")
ts_list <- get_token_stream(pb)
# Use two p-attributes
sp <- corpus("GERMAPARLMINI") %>%
as.speeches(s_attribute_name = "speaker", s_attribute_date = "date", progress = FALSE)
p2 <- get_token_stream(sp, p_attribute = c("word", "pos"), verbose = FALSE)
# Apply filter
p_sub <- get_token_stream(
sp, p_attribute = c("word", "pos"),
subset = {!grepl("(\\$.$|ART)", pos)}
)
# Concatenate phrases and apply filter
queries <- c('"freiheitliche" "Grundordnung"', '"Bundesrepublik" "Deutschland"' )
phr <- corpus("GERMAPARLMINI") %>%
cpos(query = queries) %>%
as.phrases(corpus = "GERMAPARLMINI")
kill <- tm::stopwords("de")
ts_phr <- get_token_stream(
sp,
p_attribute = c("word", "pos"),
subset = {!word %in% kill & !grepl("(\\$.$|ART)", pos)},
phrases = phr,
progress = FALSE,
verbose = FALSE
)
## End(Not run)