txt_tagsequence {udpipe} | R Documentation |
Identify a contiguous sequence of tags as 1 being entity
Description
This function allows to identify contiguous sequences of text which have the same label or
which follow the IOB scheme.
Named Entity Recognition or Chunking frequently follows the IOB tagging scheme
where "B" means the token begins an entity, "I" means it is inside an entity,
"E" means it is the end of an entity and "O" means it is not part of an entity.
An example of such an annotation would be 'New', 'York', 'City', 'District' which can be tagged as
'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC'.
The function looks for such sequences which start with 'B-LOC' and combines all subsequent
labels of the same tagging group into 1 category. This sequence of words also gets a unique identifier such
that the terms 'New', 'York', 'City', 'District' would get the same sequence identifier.
Usage
txt_tagsequence(x, entities)
Arguments
x |
a character vector of categories in the sequence of occurring (e.g. B-LOC, I-LOC, I-PER, B-PER, O, O, B-PER) |
entities |
a list of groups, where each list element contains
The list name of the group defines the label that will be assigned to the entity. If |
Value
a list with elements entity_id
and entity
where
entity is a character vector of the same length as
x
containing entities , constructed by recodingx
to the names ofnames(entities
)entity_id is an integer vector of the same length as
x
containing unique identifiers identfying the compound label sequence such that e.g. the sequence 'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC' (New York City District) would get the sameentity_id
identifier.
See the examples.
Examples
x <- data.frame(
token = c("The", "chairman", "of", "the", "Nakitoma", "Corporation",
"Donald", "Duck", "went", "skiing",
"in", "the", "Niagara", "Falls"),
upos = c("DET", "NOUN", "ADP", "DET", "PROPN", "PROPN",
"PROPN", "PROPN", "VERB", "VERB",
"ADP", "DET", "PROPN", "PROPN"),
label = c("O", "O", "O", "O", "B-ORG", "I-ORG",
"B-PERSON", "I-PERSON", "O", "O",
"O", "O", "B-LOCATION", "I-LOCATION"), stringsAsFactors = FALSE)
x[, c("sequence_id", "group")] <- txt_tagsequence(x$upos)
x
##
## Define entity groups following the IOB scheme
## and combine B-LOC I-LOC I-LOC sequences as 1 group (e.g. New York City)
groups <- list(
Location = list(start = "B-LOC", labels = c("B-LOC", "I-LOC", "E-LOC")),
Organisation = list(start = "B-ORG", labels = c("B-ORG", "I-ORG", "E-ORG")),
Person = list(start = "B-PER", labels = c("B-PER", "I-PER", "E-PER")),
Misc = list(start = "B-MISC", labels = c("B-MISC", "I-MISC", "E-MISC")))
x[, c("entity_id", "entity")] <- txt_tagsequence(x$label, groups)
x