crf_cbind_attributes {crfsuite} | R Documentation |
Enrich a data.frame by adding frequently used CRF attributes
Description
The CRF attributes which are implemented in this function
are merely the neighbouring information of a certain field.
For example the previous word, the next word, the combination of the previous 2 words.
This function cbind
s these neighbouring attributes as columns to the provided data.frame.
By default it adds the following columns to the data.frame
the term itself
(term[t])
the next term
(term[t+1])
the term after that
(term[t+2])
the previous term
(term[t-1])
the term before the previous term
(term[t-2])
as well as all combinations of these terms (bigrams/trigrams/...) where up to
ngram_max
number of terms are combined.
See the examples.
Usage
crf_cbind_attributes(
data,
terms,
by,
from = -2,
to = 2,
ngram_max = 3,
sep = "-"
)
Arguments
data |
a data.frame which will be coerced to a data.table (cbinding will be done by reference on the existing data.frame) |
terms |
a character vector of column names which are part of |
by |
a character vector of column names which are part of |
from |
integer, by default set to -2, indicating to look up to 2 terms before the current term |
to |
integer, by default set to 2, indicating to look up to 2 terms after the current term |
ngram_max |
integer indicating the maximum number of terms to combine (2 means bigrams, 3 trigrams, ...) |
sep |
character indicating how to combine the previous/next/current terms. Defaults to '-'. |
Examples
x <- data.frame(doc_id = sort(sample.int(n = 10, size = 1000, replace = TRUE)))
x$pos <- sample(c("Art", "N", "Prep", "V", "Adv", "Adj", "Conj",
"Punc", "Num", "Pron", "Int", "Misc"),
size = nrow(x), replace = TRUE)
x <- crf_cbind_attributes(x, terms = "pos", by = "doc_id",
from = -1, to = 1, ngram_max = 3)
head(x)
## Example on some real data
x <- ner_download_modeldata("conll2002-nl")
x <- crf_cbind_attributes(x, terms = c("token", "pos"),
by = c("doc_id", "sentence_id"),
ngram_max = 3, sep = "|")