R: Read an EMELD XML document containing an interlinearized...

read.emeld {interlineaR}

R Documentation

Read an EMELD XML document containing an interlinearized corpus.

Description

The EMELD XML vocabulary has been proposed for the encoding of interlinear glosses. It is used by the FieldWorks software (SIL FLEX) as an export format.

Usage

read.emeld(file, vernacular.languages, analysis.languages = "en",
  get.morphemes = TRUE, get.words = TRUE, get.sentences = TRUE,
  get.texts = TRUE, text.fields = c("title", "title-abbreviation", "source",
  "comment"), sentence.fields = c("segnum", "gls", "lit", "note"),
  words.vernacular.fields = "txt", words.analysis.fields = c("gls", "pos"),
  morphemes.vernacular.fields = c("txt", "cf"),
  morphemes.analysis.fields = c("gls", "msa", "hn"), sep = ";")

Arguments

`file`	the path (or url) to a document in ELMED vocabulary
`vernacular.languages`	character vector: one or more codes of languages analysed in the document.
`analysis.languages`	character vector: one or more codes of languages used for the analyses (in glosses, translations, notes) in the document.
`get.morphemes`	logical vector: should the returned list include a slot for the description of morphemes?
`get.words`	logical vector: should the returned list include a slot for the description of words?
`get.sentences`	logical vector: should the returned list include a slot for the description of sentences?
`get.texts`	logical vector: should the returned list include a slot for the description of texts?
`text.fields`	character vector: information to be extracted for the texts (and turned into corresponding column in the data.frame describing texts) The default are: "title" "title-abbreviation" "source" "comment"
`sentence.fields`	character vector: information to be extracted for the sentences (and turned into corresponding column in the data.frame describing sentences) The default are: "segnum" : an ID of the sentende "gls": a translation (possibly in all analysis languages) "lit": a litteral translation (possibly in all analysis languages) "note": note (possibly in all analysis languages)
`words.vernacular.fields`	character vector: information (in vernacular language(s)) to be extracted for the words (and turned into corresponding columns in the data.frame describing words) The default are: "txt" : the original text
`words.analysis.fields`	character vector: information (in analysis language(s)) to be extracted for the words (and turned into corresponding columns in the data.frame describing words) The default are: "gls" : a gloss of the word "pos" : the part of speech of the word
`morphemes.vernacular.fields`	character vector: information (in vernacular language(s)) to be extracted for the morphemes (and turned into corresponding columns in the data.frame describing morphemes). May be null or empty. "txt" : the text of the morpheme "cf" : the canonical form of the morpheme
`morphemes.analysis.fields`	character vector: information (in analysis language(s)) to be extracted for the morphemes (and turned into corresponding columns in the data.frame describing morphemes). May be null or empty. "gls" : the gloss of the morpheme "msa" : the part of speech of the morpheme "hn" : a number for the identifiation of the morpheme amongst its homophone.
`sep`	character vector: the character used to join multiple notes in the same language.

Details

If several 'note' fields in the same language are present in a sentence, they will be concatenated (see the "sep" argument)

Value

a list with slots named "morphemes", "words", "sentences", "texts" (some slot may have been excluded throuth the "get.*" arguments, see above). Each slot is a data.frame containing the information on the corresponding unit. In each data.frame, each row describe an occurrence (the first row of the result$morphemes data.frame describe the first morpheme of the corpus). In each data.frame, the first columns give ids refering to the line in other data.frame (so that we can link the first morpheme to the text, the sentence or the word it belongs to). The following columns give information about the corresponding occurrence of the unit. Which information are extracted from the document and included in the data frame depends upton the *.fields parameters (see above). Columns made are coined using the field name and the language code. For instance, if read.emeld is called with the parameters vernacular.languages="tww" and morphemes.vernacular.fields=c("txt", "cf"), then the column txt.tww and cf.tww will be created in the morphemes slot data frame.

References

Baden Hughes, Steven Bird and Catherine Bow Encoding and Presenting Interlinear Text Using XML Technologies, http://www.aclweb.org/anthology/U03-1008

SIL FieldWorks: https://software.sil.org/fieldworks/

Examples

path <- system.file("exampleData", "tuwariInterlinear.xml", package="interlineaR")
corpus <- read.emeld(path, vernacular="tww", analysis="en")
head(corpus$morphemes)

# In some cases, one may have to combine information coming from various data.frame.
# Lets imagine one needs to have in the same data.frame the morphemes data 
# plus the "note" field attached to sentences:
# - The easy way is to combine all the columns of the two data frame 'morphemes' and 'sentence' :
combined <- merge(corpus$morphemes, corpus$sentences, by.x="sentence_id", by.y="sentence_id")
head(combined)

# - Alternatively, one may use vector extraction in order to add only the desired column
# to the morphemes data frame:
corpus$morphemes$note = corpus$sentences$note.en[ corpus$morphemes$sentence_id ]
head(corpus$morphemes)

[Package interlineaR version 1.0 Index]