udpipe_train {udpipe} | R Documentation |
Train a UDPipe model
Description
Train a UDPipe model which allows to do
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing or a combination of those.
This function allows you to build models based on data in in CONLL-U format
as described at https://universaldependencies.org/format.html. At the time of writing open data in CONLL-U
format for more than 50 languages are available at https://universaldependencies.org.
Most of these are distributed under the CC-BY-SA licence or the CC-BY-NC-SA license.
This function allows to build annotation tagger models based on these data in CONLL-U format, allowing you
to have your own tagger model. This is relevant if you want to tune the tagger to your needs
or if you don't want to use ready-made models provided under the CC-BY-NC-SA license as shown at udpipe_load_model
Usage
udpipe_train(
file = file.path(getwd(), "my_annotator.udpipe"),
files_conllu_training,
files_conllu_holdout = character(),
annotation_tokenizer = "default",
annotation_tagger = "default",
annotation_parser = "default"
)
Arguments
file |
full path where the model will be saved. The model will be stored as a binary file which |
files_conllu_training |
a character vector of files in CONLL-U format used for training the model |
files_conllu_holdout |
a character vector of files in CONLL-U format used for holdout evalution of the model. This argument is optional. |
annotation_tokenizer |
a string containing options for the tokenizer. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
annotation_tagger |
a string containing options for the pos tagger and lemmatiser. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
annotation_parser |
a string containing options for the dependency parser. This can be either 'none' or 'default' or a list
of options as mentioned in the UDPipe manual. See the vignette |
Details
In order to train a model, you need to provide files which are in CONLL-U format in argument files_conllu_training
.
This can be a vector of files or just one file. If you do not have your own CONLL-U files, you can download files for your language of
choice at https://universaldependencies.org.
At the time of writing open data in CONLL-U format for 50 languages are available at https://universaldependencies.org, namely for: ancient_greek, arabic, basque, belarusian, bulgarian, catalan, chinese, coptic, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, german, gothic, greek, hebrew, hindi, hungarian, indonesian, irish, italian, japanese, kazakh, korean, latin, latvian, lithuanian, norwegian, old_church_slavonic, persian, polish, portuguese, romanian, russian, sanskrit, slovak, slovenian, spanish, swedish, tamil, turkish, ukrainian, urdu, uyghur, vietnamese.
Value
A list with elements
file: The path to the model, which can be used in
udpipe_load_model
annotation_tokenizer: The input argument
annotation_tokenizer
annotation_tagger: The input argument
annotation_tagger
annotation_parser: The input argument
annotation_parser
errors: Messages from the UDPipe process indicating possible errors for example when passing the wrong arguments to the annotation_tokenizer, annotation_tagger or annotation_parser
References
https://ufal.mff.cuni.cz/udpipe/1/users-manual
See Also
udpipe_annotation_params
, udpipe_annotate
, udpipe_load_model
,
udpipe_accuracy
Examples
## You need to have a file on disk in CONLL-U format, taking the toy example file put in the package
file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
file_conllu
cat(head(readLines(file_conllu), 3), sep="\n")
## Not run:
##
## This is a toy example showing how to build a model, it is not a good model whatsoever,
## because model building takes more than 5 seconds this model is saved also in
## the file at system.file(package = "udpipe", "dummydata", "toymodel.udpipe")
##
m <- udpipe_train(file = "toymodel.udpipe", files_conllu_training = file_conllu,
annotation_tokenizer = list(dimension = 16, epochs = 1, batch_size = 100, dropout = 0.7),
annotation_tagger = list(iterations = 1, models = 1,
provide_xpostag = 1, provide_lemma = 0, provide_feats = 0,
guesser_suffix_rules = 2, guesser_prefix_min_count = 2),
annotation_parser = list(iterations = 2,
embedding_upostag = 20, embedding_feats = 20, embedding_xpostag = 0, embedding_form = 50,
embedding_lemma = 0, embedding_deprel = 20, learning_rate = 0.01,
learning_rate_final = 0.001, l2 = 0.5, hidden_layer = 200,
batch_size = 10, transition_system = "projective", transition_oracle = "dynamic",
structured_interval = 10))
## End(Not run)
file_model <- system.file(package = "udpipe", "dummydata", "toymodel.udpipe")
ud_toymodel <- udpipe_load_model(file_model)
x <- udpipe_annotate(object = ud_toymodel, x = "Ik ging deze morgen naar de bakker brood halen.")
x <- as.data.frame(x)
##
## The above was a toy example showing how to build a model, if you want real-life scenario's
## look at the training parameter examples given below and train it on your CONLL-U file
##
## Example training arguments used for the models available at udpipe_download_model
data(udpipe_annotation_params)
head(udpipe_annotation_params$tokenizer)
head(udpipe_annotation_params$tagger)
head(udpipe_annotation_params$parser)
## Not run:
## More details in the package vignette:
vignette("udpipe-train", package = "udpipe")
## End(Not run)