read.emeld {interlineaR} | R Documentation |
Read an EMELD XML document containing an interlinearized corpus.
Description
The EMELD XML vocabulary has been proposed for the encoding of interlinear glosses. It is used by the FieldWorks software (SIL FLEX) as an export format.
Usage
read.emeld(file, vernacular.languages, analysis.languages = "en",
get.morphemes = TRUE, get.words = TRUE, get.sentences = TRUE,
get.texts = TRUE, text.fields = c("title", "title-abbreviation", "source",
"comment"), sentence.fields = c("segnum", "gls", "lit", "note"),
words.vernacular.fields = "txt", words.analysis.fields = c("gls", "pos"),
morphemes.vernacular.fields = c("txt", "cf"),
morphemes.analysis.fields = c("gls", "msa", "hn"), sep = ";")
Arguments
file |
the path (or url) to a document in ELMED vocabulary |
vernacular.languages |
character vector: one or more codes of languages analysed in the document. |
analysis.languages |
character vector: one or more codes of languages used for the analyses (in glosses, translations, notes) in the document. |
get.morphemes |
logical vector: should the returned list include a slot for the description of morphemes? |
get.words |
logical vector: should the returned list include a slot for the description of words? |
get.sentences |
logical vector: should the returned list include a slot for the description of sentences? |
get.texts |
logical vector: should the returned list include a slot for the description of texts? |
text.fields |
character vector: information to be extracted for the texts (and turned into corresponding column in the data.frame describing texts) The default are:
|
sentence.fields |
character vector: information to be extracted for the sentences (and turned into corresponding column in the data.frame describing sentences) The default are:
|
words.vernacular.fields |
character vector: information (in vernacular language(s)) to be extracted for the words (and turned into corresponding columns in the data.frame describing words) The default are:
|
words.analysis.fields |
character vector: information (in analysis language(s)) to be extracted for the words (and turned into corresponding columns in the data.frame describing words) The default are:
|
morphemes.vernacular.fields |
character vector: information (in vernacular language(s)) to be extracted for the morphemes (and turned into corresponding columns in the data.frame describing morphemes). May be null or empty.
|
morphemes.analysis.fields |
character vector: information (in analysis language(s)) to be extracted for the morphemes (and turned into corresponding columns in the data.frame describing morphemes). May be null or empty.
|
sep |
character vector: the character used to join multiple notes in the same language. |
Details
If several 'note' fields in the same language are present in a sentence, they will be concatenated (see the "sep" argument)
Value
a list with slots named "morphemes", "words", "sentences", "texts" (some slot may have been excluded throuth the "get.*" arguments, see above). Each slot is a data.frame containing the information on the corresponding unit. In each data.frame, each row describe an occurrence (the first row of the result$morphemes data.frame describe the first morpheme of the corpus). In each data.frame, the first columns give ids refering to the line in other data.frame (so that we can link the first morpheme to the text, the sentence or the word it belongs to). The following columns give information about the corresponding occurrence of the unit. Which information are extracted from the document and included in the data frame depends upton the *.fields parameters (see above). Columns made are coined using the field name and the language code. For instance, if read.emeld is called with the parameters vernacular.languages="tww" and morphemes.vernacular.fields=c("txt", "cf"), then the column txt.tww and cf.tww will be created in the morphemes slot data frame.
References
Baden Hughes, Steven Bird and Catherine Bow Encoding and Presenting Interlinear Text Using XML Technologies, http://www.aclweb.org/anthology/U03-1008
SIL FieldWorks: https://software.sil.org/fieldworks/
Examples
path <- system.file("exampleData", "tuwariInterlinear.xml", package="interlineaR")
corpus <- read.emeld(path, vernacular="tww", analysis="en")
head(corpus$morphemes)
# In some cases, one may have to combine information coming from various data.frame.
# Lets imagine one needs to have in the same data.frame the morphemes data
# plus the "note" field attached to sentences:
# - The easy way is to combine all the columns of the two data frame 'morphemes' and 'sentence' :
combined <- merge(corpus$morphemes, corpus$sentences, by.x="sentence_id", by.y="sentence_id")
head(combined)
# - Alternatively, one may use vector extraction in order to add only the desired column
# to the morphemes data frame:
corpus$morphemes$note = corpus$sentences$note.en[ corpus$morphemes$sentence_id ]
head(corpus$morphemes)