R: Train a Named Entity Recognition Model using NameTag

nametagger {nametagger}

R Documentation

Train a Named Entity Recognition Model using NameTag

Description

Train a Named Entity Recognition Model using NameTag. Details at https://ufal.mff.cuni.cz/nametag/1.

Usage

nametagger(
  x.train,
  x.test = NULL,
  iter = 30L,
  lr = c(0.1, 0.01),
  lambda = 0.5,
  stages = 1L,
  weight_missing = -0.2,
  control = nametagger_options(token = list(window = 2)),
  type = if (inherits(control, "nametagger_options")) control$type else "generic",
  tagger = if (inherits(control, "nametagger_options")) control$tagger else "trivial",
  file = if (inherits(control, "nametagger_options")) control$file else
    "nametagger.ner"
)

Arguments

`x.train`	a file with training data or a data.frame which can be passed on to `write_nametagger`
`x.test`	optionally, a file with test data or a data.frame which can be passed on to `write_nametagger`
`iter`	the number of iterations performed when training each stage of the recognizer. With more iterations, training take longer (the recognition time is unaffected), but the model gets over-trained when too many iterations are used. Values from 10 to 30 or 50 are commonly used.
`lr`	learning rates used. Should be a vector of length 2 where element 1: learning rate used in the first iteration of SGD training method of the log-linear model. Common value is 0.1. element 2: learning rate used in the last iteration of SGD training method of the log-linear model. Common values are in range from 0.1 to 0.001, with 0.01 working reasonably well.
`lambda`	the value of Gaussian prior imposed on the weights. In other words, value of L2-norm regularizer. Common value is either 0 for no regularization, or small real number like 0.5.
`stages`	the number of stages performed during recognition. Common values are either 1 or 2. With more stages, the model is larger and recognition is slower, but more accurate.
`weight_missing`	default value of missing weights in the log-linear model. Common values are small negative real numbers like -0.2.
`control`	the result of a call to `nametagger_options` a file with predictive feature transformations serving as predictive elements in the model
`type`	either one of 'generic', 'english' or 'czech'
`tagger`	either one of 'trivial' (no lemma used in the training data), 'external' (you provided your own lemma in the training data)
`file`	path to the filename where the model will be saved

Value

an object of class nametagger containing an extra list element called stats containing information on the evolution of the log probability and the accuracy on the training and optionally the test set

Examples

data(europeananews)
x <- subset(europeananews, doc_id %in% "enp_NL.kb.bio")
traindata <- subset(x, sentence_id >  100)
testdata  <- subset(x, sentence_id <= 100)
path <- "nametagger-nl.ner" 
 
opts <- nametagger_options(file = path,
                           token = list(window = 2),
                           token_normalisedsuffix = list(window = 0, from = 1, to = 4),
                           ner_previous = list(window = 2),
                           time = list(use = TRUE),
                           url_email = list(url = "URL", email = "EMAIL"))


model <- nametagger(x.train = traindata, 
                    x.test = testdata,
                    iter = 30, lambda = 0.5,
                    control = opts)

model
model$stats
plot(model$stats$iteration, model$stats$logprob, type = "b")
plot(model$stats$iteration, model$stats$accuracy_train, type = "b", ylim = c(95, 100))
lines(model$stats$iteration, model$stats$accuracy_test, type = "b", lty = 2, col = "red")

predict(model, 
        "Ik heet Karel je kan me bereiken op paul@duchanel.be of www.duchanel.be", 
        split = "[[:space:]]+")


features <- system.file(package = "nametagger", 
                        "models", "features_default.txt")
cat(readLines(features), sep = "\n")
path_traindata <- "traindata.txt" 

write_nametagger(x, file = path_traindata)


model <- nametagger(path_traindata, iter = 30, control = features, file = path)
model

[Package nametagger version 0.1.3 Index]