merge.chunkrange {crfsuite} | R Documentation |
CRF Training data construction: add chunk entity category to a tokenised dataset
Description
Chunks annotated with the shiny app in this R package indicate for a chunk of text of a document
the entity that it belongs to. As text chunks can contains several words, we need to have a way in
order to add this chunk category to each word of a tokenised dataset. That's what this function is doing.
If you have a tokenised data.frame with one row per token/document which indicates the start and end position
where the token is found in the text of the document, this function allows to assign the chunk label to each token
of the document.
Usage
## S3 method for class 'chunkrange'
merge(x, y, by.x = "doc_id", by.y = "doc_id", default_entity = "O", ...)
Arguments
x |
an object of class |
y |
a tokenised data.frame containing one row per doc_id/token It should have the columns |
by.x |
a character string of a column of |
by.y |
a character string of a column of |
default_entity |
character string with the default |
... |
not used |
Value
the data.frame y
where 2 columns are added, namely:
chunk_entity: The chunk entity of the token if the token is inside the chunk defined in
x
. If the token is not part of any chunk, the chunk category will be set to thedefault
value.chunk_id: The chunk identifier of the chunk for which the token is inside the chunk.
Examples
library(udpipe)
udmodel <- udpipe_download_model("dutch-lassysmall")
if(packageVersion("udpipe") >= "0.7"){
data(airbnb_chunks, package = "crfsuite")
airbnb_chunks <- head(airbnb_chunks, 20)
airbnb_tokens <- unique(airbnb_chunks[, c("doc_id", "text")])
airbnb_tokens <- udpipe(airbnb_tokens, object = udmodel)
head(airbnb_tokens)
head(airbnb_chunks)
## Add the entity of the chunk to the tokenised dataset
x <- merge(airbnb_chunks, airbnb_tokens)
x[, c("doc_id", "token", "chunk_entity")]
table(x$chunk_entity)
}
## cleanup for CRAN
file.remove(udmodel$file_model)