corpus_segment {quanteda} | R Documentation |
Segment texts on a pattern match
Description
Segment corpus text(s) or a character vector, splitting on a pattern match. This is useful for breaking the texts into smaller documents based on a regular pattern (such as a speaker identifier in a transcript) or a user-supplied annotation.
Usage
corpus_segment(
x,
pattern = "##*",
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
extract_pattern = TRUE,
pattern_position = c("before", "after"),
use_docvars = TRUE
)
char_segment(
x,
pattern = "##*",
valuetype = c("glob", "regex", "fixed"),
case_insensitive = TRUE,
remove_pattern = TRUE,
pattern_position = c("before", "after")
)
Arguments
x |
character or corpus object whose texts will be segmented |
pattern |
a character vector, list of character vectors, dictionary, or collocations object. See pattern for details. |
valuetype |
the type of pattern matching: |
case_insensitive |
logical; if |
extract_pattern |
extracts matched patterns from the texts and save in docvars if
|
pattern_position |
either |
use_docvars |
if |
remove_pattern |
removes matched patterns from the texts if |
Details
For segmentation into syntactic units defined by the locale (such as
sentences), use corpus_reshape()
instead. In cases where more
fine-grained segmentation is needed, such as that based on commas or
semi-colons (phrase delimiters within a sentence),
corpus_segment()
offers greater user control than
corpus_reshape()
.
Value
corpus_segment
returns a corpus of segmented texts
char_segment
returns a character vector of segmented texts
Boundaries and segmentation explained
The pattern
acts as a
boundary delimiter that defines the segmentation points for splitting a
text into new "document" units. Boundaries are always defined as the
pattern matches, plus the end and beginnings of each document. The new
"documents" that are created following the segmentation will then be the
texts found between boundaries.
The pattern itself will be saved as a new document variable named
pattern
. This is most useful when segmenting a text according to
tags such as names in a transcript, section titles, or user-supplied
annotations. If the beginning of the file precedes a pattern match, then
the extracted text will have a NA
for the extracted pattern
document variable (or when pattern_position = "after"
, this will be
true for the text split between the last pattern match and the end of the
document).
To extract syntactically defined sub-document units such as sentences and
paragraphs, use corpus_reshape()
instead.
Using patterns
One of the most common uses for
corpus_segment
is to partition a corpus into sub-documents using
tags. The default pattern value is designed for a user-annotated tag that
is a term beginning with double "hash" signs, followed by a whitespace, for
instance as ##INTRODUCTION The text
.
Glob and fixed pattern types use a whitespace character to signal the end of the pattern.
For more advanced pattern matches that could include whitespace or newlines, a regex pattern type can be used, for instance a text such as
Mr. Smith: Text
Mrs. Jones: More text
could have as pattern = "\\b[A-Z].+\\.\\s[A-Z][a-z]+:"
, which
would catch the title, the name, and the colon.
For custom boundary delimitation using punctuation characters that come
come at the end of a clause or sentence (such as ,
and.
,
these can be specified manually and pattern_position
set to
"after"
. To keep the punctuation characters in the text (as with
sentence segmentation), set extract_pattern = FALSE
. (With most tag
applications, users will want to remove the patterns from the text, as they
are annotations rather than parts of the text itself.)
See Also
corpus_reshape()
, for segmenting texts into pre-defined
syntactic units such as sentences, paragraphs, or fixed-length chunks
Examples
## segmenting a corpus
# segmenting a corpus using tags
corp1 <- corpus(c("##INTRO This is the introduction.
##DOC1 This is the first document. Second sentence in Doc 1.
##DOC3 Third document starts here. End of third document.",
"##INTRO Document ##NUMBER Two starts before ##NUMBER Three."))
corpseg1 <- corpus_segment(corp1, pattern = "##*")
cbind(corpseg1, docvars(corpseg1))
# segmenting a transcript based on speaker identifiers
corp2 <- corpus("Mr. Smith: Text.\nMrs. Jones: More text.\nMr. Smith: I'm speaking, again.")
corpseg2 <- corpus_segment(corp2, pattern = "\\b[A-Z].+\\s[A-Z][a-z]+:",
valuetype = "regex")
cbind(corpseg2, docvars(corpseg2))
# segmenting a corpus using crude end-of-sentence segmentation
corpseg3 <- corpus_segment(corp1, pattern = ".", valuetype = "fixed",
pattern_position = "after", extract_pattern = FALSE)
cbind(corpseg3, docvars(corpseg3))
## segmenting a character vector
# segment into paragraphs and removing the "- " bullet points
cat(data_char_ukimmig2010[4])
char_segment(data_char_ukimmig2010[4],
pattern = "\\n\\n(-\\s){0,1}", valuetype = "regex",
remove_pattern = TRUE)
# segment a text into clauses
txt <- c(d1 = "This, is a sentence? You: come here.", d2 = "Yes, yes okay.")
char_segment(txt, pattern = "\\p{P}", valuetype = "regex",
pattern_position = "after", remove_pattern = FALSE)