merge.chunkrange {crfsuite}R Documentation

CRF Training data construction: add chunk entity category to a tokenised dataset

Description

Chunks annotated with the shiny app in this R package indicate for a chunk of text of a document the entity that it belongs to. As text chunks can contains several words, we need to have a way in order to add this chunk category to each word of a tokenised dataset. That's what this function is doing.
If you have a tokenised data.frame with one row per token/document which indicates the start and end position where the token is found in the text of the document, this function allows to assign the chunk label to each token of the document.

Usage

## S3 method for class 'chunkrange'
merge(x, y, by.x = "doc_id", by.y = "doc_id", default_entity = "O", ...)

Arguments

x

an object of class chunkrange. A chunkrange is just a data.frame which contains one row per chunk/doc_id. It should have the columns doc_id, text, chunk_id, chunk_entity, start and end.
The fields start and end indicate in the original text where the chunks of words starts and where it ends. The chunk_entity is a label you have assigned to the chunk (e.g. ORGANISATION / LOCATION / MONEY / LABELXYZ / ...).

y

a tokenised data.frame containing one row per doc_id/token It should have the columns doc_id, start and end where the fields start and end indicate the positions in the original text of the doc_id where the token starts and where it ends. See the examples.

by.x

a character string of a column of x which is an identifier which defines the sequence. Defaults to 'doc_id'.

by.y

a character string of a column of y which is an identifier which defines the sequence. Defaults to 'doc_id'.

default_entity

character string with the default chunk_entity to be assigned to the token if the token is not part of any chunk range. Defaults to 'O'.

...

not used

Value

the data.frame y where 2 columns are added, namely:

Examples



library(udpipe)
udmodel <- udpipe_download_model("dutch-lassysmall")
if(packageVersion("udpipe") >= "0.7"){
  data(airbnb_chunks, package = "crfsuite")
  airbnb_chunks <- head(airbnb_chunks, 20)
  airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])

  airbnb_tokens <- udpipe(airbnb_tokens, object = udmodel)
  head(airbnb_tokens)
  head(airbnb_chunks)

  ## Add the entity of the chunk to the tokenised dataset
  x <- merge(airbnb_chunks, airbnb_tokens)
  x[, c("doc_id", "token", "chunk_entity")]
  table(x$chunk_entity)
}

## cleanup for CRAN
file.remove(udmodel$file_model)



[Package crfsuite version 0.4.2 Index]