set.lang.support {koRpus} | R Documentation |
Add support for new languages
Description
You can use this function to add new languages to be used with koRpus
.
Usage
set.lang.support(target, value, merge = TRUE)
Arguments
target |
One of "kRp.POS.tags", "treetag", or "hyphen", depending on what support is to be added. |
value |
A named list that upholds exactly the structure defined here for its respective |
merge |
Logical,
only relevant for the "kRp.POS.tags" target. This argument controls whether |
Details
Language support in this package is designed to be extended easily. You could call it modular, although it's actually more "environemntal", but nevermind.
To add full new language support, say for Xyzedish, you basically have to call this function three times (or at least twice, see hyphen section below) with different targets. If you would like to re-use this language support, you should consider making it a package.
Be it a package or a script, it should contain all three calls to this function. If it succeeds, it will fill an internal environment with the information you have defined.
The function set.language.support()
gets called three times because there's three
functions of koRpus that need language support:
treetag() needs the preset information from its own start scripts
kRp.POS.tags() needs to learn all possible POS tags that TreeTagger uses for the given language
hyphen() needs to know which language pattern tests are available as data files (which you must provide also)
All the calls follow the same pattern – first,
you name one of the three targets explained above,
and second,
you provide a named list as the value
for the respective target
function.
"treetag"
The presets for the treetag() function are basically what the shell (GNU/Linux, MacOS) and batch (Win) scripts define that come with TreeTagger. Look for scripts called "$TREETAGGER/cmd/tree-tagger-xyzedish" and "$TREETAGGER\cmd\tree-tagger-xyzedish.bat", figure out which call resembles which call and then define them in set.lang.support("treetag") accordingly.
Have a look at the commented template in your koRpus
installation directory for an elaborate
example.
"kRp.POS.tags"
If Xyzedish is supported by TreeTagger, you should find a tagset definition for the language on its homepage. treetag() needs to know all POS tags that TreeTagger might return, otherwise you will get a self-explaining error message as soon as an unknown tag appears. Notice that this can still happen after you implemented the full documented tag set: sometimes the contributed TreeTagger parameter files added their own tags, e.g., for special punctuation. So please test your tag set well.
As you can see in the template file, you will also have to add a global word class and an explaination for each tag. The former is especially important for further steps like frequency analysis.
Again, please have a look at the commented template and/or existing language support files in the package sources, most of it should be almost self-explaining.
"hyphen"
Using the target "hyphen" will cause a call to the equivalent of this function in the sylly
package.
See the documentation of its set.hyph.support
function for details.
Packaging
If you would like to create a proper language support package,
you should only include the "treetag" and
"kRp.POS.tags" calls,
and the hyphenation patterns should be loaded as a dependency to a package called
sylly.xx
. You can generate such a sylly package rather quickly by using the private function
sylly:::sylly_langpack()
.
Examples
hyph_pat_yxz <- sylly::kRp_hyph_pat(
lang = "xy",
pattern = matrix(
c(
".im5b", ".in1", ".in3d",
".imb", ".in", ".ind",
"0050", "001", "0030"
),
nrow=3,
dimnames= list(
NULL,
c("orig", "char", "nums")
)
)
)
set.lang.support(
target="hyphen",
value=list("xyz"=hyph_pat_yxz)
)