set.lang.support {koRpus}R Documentation

Add support for new languages

Description

You can use this function to add new languages to be used with koRpus.

Usage

set.lang.support(target, value, merge = TRUE)

Arguments

target

One of "kRp.POS.tags", "treetag", or "hyphen", depending on what support is to be added.

value

A named list that upholds exactly the structure defined here for its respective target.

merge

Logical, only relevant for the "kRp.POS.tags" target. This argument controls whether value will completely replace an already present tagset definition, or merge all given tags (i.e., replace single tags with an updated definition or add new tags).

Details

Language support in this package is designed to be extended easily. You could call it modular, although it's actually more "environemntal", but nevermind.

To add full new language support, say for Xyzedish, you basically have to call this function three times (or at least twice, see hyphen section below) with different targets. If you would like to re-use this language support, you should consider making it a package.

Be it a package or a script, it should contain all three calls to this function. If it succeeds, it will fill an internal environment with the information you have defined.

The function set.language.support() gets called three times because there's three functions of koRpus that need language support:

All the calls follow the same pattern – first, you name one of the three targets explained above, and second, you provide a named list as the value for the respective target function.

"treetag"

The presets for the treetag() function are basically what the shell (GNU/Linux, MacOS) and batch (Win) scripts define that come with TreeTagger. Look for scripts called "$TREETAGGER/cmd/tree-tagger-xyzedish" and "$TREETAGGER\cmd\tree-tagger-xyzedish.bat", figure out which call resembles which call and then define them in set.lang.support("treetag") accordingly.

Have a look at the commented template in your koRpus installation directory for an elaborate example.

"kRp.POS.tags"

If Xyzedish is supported by TreeTagger, you should find a tagset definition for the language on its homepage. treetag() needs to know all POS tags that TreeTagger might return, otherwise you will get a self-explaining error message as soon as an unknown tag appears. Notice that this can still happen after you implemented the full documented tag set: sometimes the contributed TreeTagger parameter files added their own tags, e.g., for special punctuation. So please test your tag set well.

As you can see in the template file, you will also have to add a global word class and an explaination for each tag. The former is especially important for further steps like frequency analysis.

Again, please have a look at the commented template and/or existing language support files in the package sources, most of it should be almost self-explaining.

"hyphen"

Using the target "hyphen" will cause a call to the equivalent of this function in the sylly package. See the documentation of its set.hyph.support function for details.

Packaging

If you would like to create a proper language support package, you should only include the "treetag" and "kRp.POS.tags" calls, and the hyphenation patterns should be loaded as a dependency to a package called sylly.xx. You can generate such a sylly package rather quickly by using the private function sylly:::sylly_langpack().

Examples

hyph_pat_yxz <- sylly::kRp_hyph_pat(
  lang = "xy",
  pattern = matrix(
    c(
      ".im5b", ".in1", ".in3d",
      ".imb", ".in", ".ind",
      "0050", "001", "0030"
    ),
    nrow=3,
    dimnames= list(
      NULL,
      c("orig", "char", "nums")
    )
  )
)
set.lang.support(
  target="hyphen",
  value=list("xyz"=hyph_pat_yxz)
)

[Package koRpus version 0.13-8 Index]