ECIMCI_profiles {textcat} | R Documentation |
ECI/MCI N
-Gram Profiles
Description
N
-gram profile db for 26 languages based on the European Corpus
Initiative Multilingual Corpus I.
Usage
ECIMCI_profiles
Details
This profile db was built by Johannes Rauch, using the ECI/MCI corpus (http://www.elsnet.org/eci.html) and the default options employed by package textcat, with all text documents encoded in UTF-8.
The category ids used for the db are the respective IETF language tags
(see language in package NLP), using the ISO 639-2
Part B language subtags and, for Serbian, the script employed (i.e.,
"scc-Cyrl"
and "scc-Latn"
for Serbian written in
Cyrillic and Latin script, respectively; all other languages in the
profile db are written in Latin script.)
References
S. Armstrong-Warwick, H. S. Thompson, D. McKelvie and D. Petitpierre (1994), Data in Your Language: The ECI Multilingual Corpus 1. In “Proceedings of the International Workshop on Sharable Natural Language Resources” (Nara, Japan), 97–106. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.44.950
Examples
## Languages in the the ECI/MCI profile db:
names(ECIMCI_profiles)
## Key options used for the profile:
attr(ECIMCI_profiles, "options")[c("n", "size", "reduce", "useBytes")]