nametagger {nametagger} | R Documentation |
Train a Named Entity Recognition Model using NameTag
Description
Train a Named Entity Recognition Model using NameTag. Details at https://ufal.mff.cuni.cz/nametag/1.
Usage
nametagger(
x.train,
x.test = NULL,
iter = 30L,
lr = c(0.1, 0.01),
lambda = 0.5,
stages = 1L,
weight_missing = -0.2,
control = nametagger_options(token = list(window = 2)),
type = if (inherits(control, "nametagger_options")) control$type else "generic",
tagger = if (inherits(control, "nametagger_options")) control$tagger else "trivial",
file = if (inherits(control, "nametagger_options")) control$file else
"nametagger.ner"
)
Arguments
x.train |
a file with training data or a data.frame which can be passed on to |
x.test |
optionally, a file with test data or a data.frame which can be passed on to |
iter |
the number of iterations performed when training each stage of the recognizer. With more iterations, training take longer (the recognition time is unaffected), but the model gets over-trained when too many iterations are used. Values from 10 to 30 or 50 are commonly used. |
lr |
learning rates used. Should be a vector of length 2 where
|
lambda |
the value of Gaussian prior imposed on the weights. In other words, value of L2-norm regularizer. Common value is either 0 for no regularization, or small real number like 0.5. |
stages |
the number of stages performed during recognition. Common values are either 1 or 2. With more stages, the model is larger and recognition is slower, but more accurate. |
weight_missing |
default value of missing weights in the log-linear model. Common values are small negative real numbers like -0.2. |
control |
the result of a call to |
type |
either one of 'generic', 'english' or 'czech' |
tagger |
either one of 'trivial' (no lemma used in the training data), 'external' (you provided your own lemma in the training data) |
file |
path to the filename where the model will be saved |
Value
an object of class nametagger
containing an extra list element called stats containing information on the evolution of the log probability and the accuracy on the training and optionally the test set
Examples
data(europeananews)
x <- subset(europeananews, doc_id %in% "enp_NL.kb.bio")
traindata <- subset(x, sentence_id > 100)
testdata <- subset(x, sentence_id <= 100)
path <- "nametagger-nl.ner"
opts <- nametagger_options(file = path,
token = list(window = 2),
token_normalisedsuffix = list(window = 0, from = 1, to = 4),
ner_previous = list(window = 2),
time = list(use = TRUE),
url_email = list(url = "URL", email = "EMAIL"))
model <- nametagger(x.train = traindata,
x.test = testdata,
iter = 30, lambda = 0.5,
control = opts)
model
model$stats
plot(model$stats$iteration, model$stats$logprob, type = "b")
plot(model$stats$iteration, model$stats$accuracy_train, type = "b", ylim = c(95, 100))
lines(model$stats$iteration, model$stats$accuracy_test, type = "b", lty = 2, col = "red")
predict(model,
"Ik heet Karel je kan me bereiken op paul@duchanel.be of www.duchanel.be",
split = "[[:space:]]+")
features <- system.file(package = "nametagger",
"models", "features_default.txt")
cat(readLines(features), sep = "\n")
path_traindata <- "traindata.txt"
write_nametagger(x, file = path_traindata)
model <- nametagger(path_traindata, iter = 30, control = features, file = path)
model