nametagger {nametagger}R Documentation

Train a Named Entity Recognition Model using NameTag

Description

Train a Named Entity Recognition Model using NameTag. Details at https://ufal.mff.cuni.cz/nametag/1.

Usage

nametagger(
  x.train,
  x.test = NULL,
  iter = 30L,
  lr = c(0.1, 0.01),
  lambda = 0.5,
  stages = 1L,
  weight_missing = -0.2,
  control = nametagger_options(token = list(window = 2)),
  type = if (inherits(control, "nametagger_options")) control$type else "generic",
  tagger = if (inherits(control, "nametagger_options")) control$tagger else "trivial",
  file = if (inherits(control, "nametagger_options")) control$file else
    "nametagger.ner"
)

Arguments

x.train

a file with training data or a data.frame which can be passed on to write_nametagger

x.test

optionally, a file with test data or a data.frame which can be passed on to write_nametagger

iter

the number of iterations performed when training each stage of the recognizer. With more iterations, training take longer (the recognition time is unaffected), but the model gets over-trained when too many iterations are used. Values from 10 to 30 or 50 are commonly used.

lr

learning rates used. Should be a vector of length 2 where

  • element 1: learning rate used in the first iteration of SGD training method of the log-linear model. Common value is 0.1.

  • element 2: learning rate used in the last iteration of SGD training method of the log-linear model. Common values are in range from 0.1 to 0.001, with 0.01 working reasonably well.

lambda

the value of Gaussian prior imposed on the weights. In other words, value of L2-norm regularizer. Common value is either 0 for no regularization, or small real number like 0.5.

stages

the number of stages performed during recognition. Common values are either 1 or 2. With more stages, the model is larger and recognition is slower, but more accurate.

weight_missing

default value of missing weights in the log-linear model. Common values are small negative real numbers like -0.2.

control

the result of a call to nametagger_options a file with predictive feature transformations serving as predictive elements in the model

type

either one of 'generic', 'english' or 'czech'

tagger

either one of 'trivial' (no lemma used in the training data), 'external' (you provided your own lemma in the training data)

file

path to the filename where the model will be saved

Value

an object of class nametagger containing an extra list element called stats containing information on the evolution of the log probability and the accuracy on the training and optionally the test set

Examples

data(europeananews)
x <- subset(europeananews, doc_id %in% "enp_NL.kb.bio")
traindata <- subset(x, sentence_id >  100)
testdata  <- subset(x, sentence_id <= 100)
path <- "nametagger-nl.ner" 
 
opts <- nametagger_options(file = path,
                           token = list(window = 2),
                           token_normalisedsuffix = list(window = 0, from = 1, to = 4),
                           ner_previous = list(window = 2),
                           time = list(use = TRUE),
                           url_email = list(url = "URL", email = "EMAIL"))


model <- nametagger(x.train = traindata, 
                    x.test = testdata,
                    iter = 30, lambda = 0.5,
                    control = opts)

model
model$stats
plot(model$stats$iteration, model$stats$logprob, type = "b")
plot(model$stats$iteration, model$stats$accuracy_train, type = "b", ylim = c(95, 100))
lines(model$stats$iteration, model$stats$accuracy_test, type = "b", lty = 2, col = "red")

predict(model, 
        "Ik heet Karel je kan me bereiken op paul@duchanel.be of www.duchanel.be", 
        split = "[[:space:]]+")


features <- system.file(package = "nametagger", 
                        "models", "features_default.txt")
cat(readLines(features), sep = "\n")
path_traindata <- "traindata.txt" 

write_nametagger(x, file = path_traindata)


model <- nametagger(path_traindata, iter = 30, control = features, file = path)
model




[Package nametagger version 0.1.3 Index]