txt_tagsequence {udpipe}R Documentation

Identify a contiguous sequence of tags as 1 being entity

Description

This function allows to identify contiguous sequences of text which have the same label or which follow the IOB scheme.
Named Entity Recognition or Chunking frequently follows the IOB tagging scheme where "B" means the token begins an entity, "I" means it is inside an entity, "E" means it is the end of an entity and "O" means it is not part of an entity. An example of such an annotation would be 'New', 'York', 'City', 'District' which can be tagged as 'B-LOC', 'I-LOC', 'I-LOC', 'E-LOC'.
The function looks for such sequences which start with 'B-LOC' and combines all subsequent labels of the same tagging group into 1 category. This sequence of words also gets a unique identifier such that the terms 'New', 'York', 'City', 'District' would get the same sequence identifier.

Usage

txt_tagsequence(x, entities)

Arguments

x

a character vector of categories in the sequence of occurring (e.g. B-LOC, I-LOC, I-PER, B-PER, O, O, B-PER)

entities

a list of groups, where each list element contains

  • start: A length 1 character string with the start element identifying a sequence start. E.g. 'B-LOC'

  • labels: A character vector containing all the elements which are considered being part of a same labelling sequence, including the starting element. E.g. c('B-LOC', 'I-LOC', 'E-LOC')

The list name of the group defines the label that will be assigned to the entity. If entities is not provided each possible value of x is considered an entity. See the examples.

Value

a list with elements entity_id and entity where

See the examples.

Examples

x <- data.frame(
  token = c("The", "chairman", "of", "the", "Nakitoma", "Corporation", 
           "Donald", "Duck", "went", "skiing", 
            "in", "the", "Niagara", "Falls"),
  upos = c("DET", "NOUN", "ADP", "DET", "PROPN", "PROPN", 
           "PROPN", "PROPN", "VERB", "VERB", 
           "ADP", "DET", "PROPN", "PROPN"),
  label = c("O", "O", "O", "O", "B-ORG", "I-ORG", 
            "B-PERSON", "I-PERSON", "O", "O", 
            "O", "O", "B-LOCATION", "I-LOCATION"), stringsAsFactors = FALSE)
x[, c("sequence_id", "group")] <- txt_tagsequence(x$upos)
x

##
## Define entity groups following the IOB scheme
## and combine B-LOC I-LOC I-LOC sequences as 1 group (e.g. New York City) 
groups <- list(
 Location = list(start = "B-LOC", labels = c("B-LOC", "I-LOC", "E-LOC")),
 Organisation =  list(start = "B-ORG", labels = c("B-ORG", "I-ORG", "E-ORG")),
 Person = list(start = "B-PER", labels = c("B-PER", "I-PER", "E-PER")), 
 Misc = list(start = "B-MISC", labels = c("B-MISC", "I-MISC", "E-MISC")))
x[, c("entity_id", "entity")] <- txt_tagsequence(x$label, groups)
x

[Package udpipe version 0.8.11 Index]