crf_cbind_attributes {crfsuite}R Documentation

Enrich a data.frame by adding frequently used CRF attributes

Description

The CRF attributes which are implemented in this function are merely the neighbouring information of a certain field. For example the previous word, the next word, the combination of the previous 2 words. This function cbinds these neighbouring attributes as columns to the provided data.frame.

By default it adds the following columns to the data.frame

See the examples.

Usage

crf_cbind_attributes(
  data,
  terms,
  by,
  from = -2,
  to = 2,
  ngram_max = 3,
  sep = "-"
)

Arguments

data

a data.frame which will be coerced to a data.table (cbinding will be done by reference on the existing data.frame)

terms

a character vector of column names which are part of data for which the function will look to the preceding and following rows in order to cbind this information to the data

by

a character vector of column names which are part of data indicating the fields which define the sequence. Preceding/following terms will be looked for within data of by. Typically this will be a document identifier or sentence identifier in an NLP context.

from

integer, by default set to -2, indicating to look up to 2 terms before the current term

to

integer, by default set to 2, indicating to look up to 2 terms after the current term

ngram_max

integer indicating the maximum number of terms to combine (2 means bigrams, 3 trigrams, ...)

sep

character indicating how to combine the previous/next/current terms. Defaults to '-'.

Examples

x <- data.frame(doc_id = sort(sample.int(n = 10, size = 1000, replace = TRUE)))
x$pos <- sample(c("Art", "N", "Prep", "V", "Adv", "Adj", "Conj", 
                  "Punc", "Num", "Pron", "Int", "Misc"), 
                  size = nrow(x), replace = TRUE)
x <- crf_cbind_attributes(x, terms = "pos", by = "doc_id", 
                          from = -1, to = 1, ngram_max = 3)
head(x)


## Example on some real data
x <- ner_download_modeldata("conll2002-nl")
x <- crf_cbind_attributes(x, terms = c("token", "pos"), 
                          by = c("doc_id", "sentence_id"),
                          ngram_max = 3, sep = "|")


[Package crfsuite version 0.4.2 Index]