R: Convert a data.frame to CONLL-U format

as_conllu {udpipe}

R Documentation

Convert a data.frame to CONLL-U format

Description

If you have a data.frame with annotations containing 1 row per token, you can convert it to CONLL-U format with this function. The data frame is required to have the following columns: doc_id, sentence_id, sentence, token_id, token and optionally has the following columns: lemma, upos, xpos, feats, head_token_id, dep_rel, deps, misc. Where these fields have the following meaning

doc_id: the identifier of the document
sentence_id: the identifier of the sentence
sentence: the text of the sentence for which this token is part of
token_id: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes.
token: Word form or punctuation symbol.
lemma: Lemma or stem of word form.
upos: Universal part-of-speech tag.
xpos: Language-specific part-of-speech tag; underscore if not available.
feats: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
head_token_id: Head of the current word, which is either a value of token_id or zero (0).
dep_rel: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
deps: Enhanced dependency graph in the form of a list of head-deprel pairs.
misc: Any other annotation.

The tokens in the data.frame should be ordered as they appear in the sentence.

Usage

as_conllu(x)

Arguments

`x`	a data.frame with columns doc_id, sentence_id, sentence, token_id, token, lemma, upos, xpos, feats, head_token_id, deprel, dep_rel, misc

Value

a character string of length 1 containing the data.frame in CONLL-U format. See the example. You can easily save this to disk for processing in other applications.

References

https://universaldependencies.org/format.html

Examples

file_conllu <- system.file(package = "udpipe", "dummydata", "traindata.conllu")
x <- udpipe_read_conllu(file_conllu)
str(x)
conllu <- as_conllu(x)
cat(conllu)
## Not run: 
## Write it to file, making sure it is in UTF-8
cat(as_conllu(x), file = file("annotations.conllu", encoding = "UTF-8"))

## End(Not run)

## Some fields are not mandatory, they will assummed to be NA
conllu <- as_conllu(x[, c('doc_id', 'sentence_id', 'sentence', 
                          'token_id', 'token', 'upos')])
cat(conllu)

[Package udpipe version 0.8.11 Index]