bidictionary {word.alignment} | R Documentation |
Building an Automatic Bilingual Dictionary
Description
It builds an automatic bilingual dictionary of two languages based on given sentence-aligned parallel corpus.
Usage
bidictionary (..., n = -1L, iter = 15, prob = 0.8,
dtfile.path = NULL, name.sorc = 'f', name.trgt = 'e')
Arguments
... |
Further arguments to be passed to |
n |
Number of sentences to be read. |
iter |
the number of iterations for IBM Model 1. |
prob |
the minimum word translation probanility. |
dtfile.path |
if If specific file name is set, it will be read and continue the rest of the function, i.e. : finding dictionary of two given languages. |
name.sorc |
source language's name in mydictionary. |
name.trgt |
traget language's name in mydictionary. |
Details
The results depend on the corpus. As an example, we have used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size of 170 Mb. For the 10,000 first sentences, we have a nice dictionary. It just takes 1.356784 mins using an ordinary computer. The results can be found at
http://www.um.ac.ir/~sarmad/word.a/bidictionary.pdf
Value
A list.
time |
A number. (in second/minute/hour) |
number_input |
An integer. |
Value_prob |
A decimal number between 0 and 1. |
iterIBM1 |
An integer. |
dictionary |
A matrix. |
Note
Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.
In addition, if dtfile.path = NULL
, the following question will be asked:
"Are you sure that you want to run the align.ibm1 function (It takes time)? (Yes/ No: if you want to specify word alignment path, please press 'No'.)
Author(s)
Neda Daneshgar and Majid Sarmad.
References
Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.
http://statmt.org/europarl/v7/bg-en.tgz
See Also
Examples
# Since the extraction of bg-en.tgz in Europarl corpus is time consuming,
# so the aforementioned unzip files have been temporarily exported to
# http://www.um.ac.ir/~sarmad/... .
## Not run:
dic1 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
'http://www.um.ac.ir/~sarmad/word.a/euro.en',
n = 2000, encode.sorc = 'UTF-8',
name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH')
dic2 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
'http://www.um.ac.ir/~sarmad/word.a/euro.en',
n = 2000, encode.sorc = 'UTF-8',
name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH',
remove.pt = FALSE)
## End(Not run)