R: CRF Training data construction: add chunk entity category to...

merge.chunkrange {crfsuite}

R Documentation

CRF Training data construction: add chunk entity category to a tokenised dataset

Description

Chunks annotated with the shiny app in this R package indicate for a chunk of text of a document the entity that it belongs to. As text chunks can contains several words, we need to have a way in order to add this chunk category to each word of a tokenised dataset. That's what this function is doing.
If you have a tokenised data.frame with one row per token/document which indicates the start and end position where the token is found in the text of the document, this function allows to assign the chunk label to each token of the document.

Usage

## S3 method for class 'chunkrange'
merge(x, y, by.x = "doc_id", by.y = "doc_id", default_entity = "O", ...)

Arguments

`x`	an object of class `chunkrange`. A `chunkrange` is just a data.frame which contains one row per chunk/doc_id. It should have the columns doc_id, text, chunk_id, chunk_entity, start and end. The fields `start` and `end` indicate in the original `text` where the chunks of words starts and where it ends. The `chunk_entity` is a label you have assigned to the chunk (e.g. ORGANISATION / LOCATION / MONEY / LABELXYZ / ...).
`y`	a tokenised data.frame containing one row per doc_id/token It should have the columns `doc_id`, `start` and `end` where the fields `start` and `end` indicate the positions in the original text of the `doc_id` where the token starts and where it ends. See the examples.
`by.x`	a character string of a column of `x` which is an identifier which defines the sequence. Defaults to 'doc_id'.
`by.y`	a character string of a column of `y` which is an identifier which defines the sequence. Defaults to 'doc_id'.
`default_entity`	character string with the default `chunk_entity` to be assigned to the token if the token is not part of any chunk range. Defaults to 'O'.
`...`	not used

Value

the data.frame y where 2 columns are added, namely:

chunk_entity: The chunk entity of the token if the token is inside the chunk defined in x. If the token is not part of any chunk, the chunk category will be set to the default value.
chunk_id: The chunk identifier of the chunk for which the token is inside the chunk.

Examples



library(udpipe)
udmodel <- udpipe_download_model("dutch-lassysmall")
if(packageVersion("udpipe") >= "0.7"){
  data(airbnb_chunks, package = "crfsuite")
  airbnb_chunks <- head(airbnb_chunks, 20)
  airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])

  airbnb_tokens <- udpipe(airbnb_tokens, object = udmodel)
  head(airbnb_tokens)
  head(airbnb_chunks)

  ## Add the entity of the chunk to the tokenised dataset
  x <- merge(airbnb_chunks, airbnb_tokens)
  x[, c("doc_id", "token", "chunk_entity")]
  table(x$chunk_entity)
}

## cleanup for CRAN
file.remove(udmodel$file_model)

[Package crfsuite version 0.4.2 Index]