R: Building an Automatic Bilingual Dictionary

bidictionary {word.alignment}

R Documentation

Building an Automatic Bilingual Dictionary

Description

It builds an automatic bilingual dictionary of two languages based on given sentence-aligned parallel corpus.

Usage

bidictionary (..., n = -1L, iter = 15, prob = 0.8,  
              dtfile.path = NULL, name.sorc = 'f', name.trgt = 'e')

Arguments

`...`	Further arguments to be passed to `prepare.data`.
`n`	Number of sentences to be read.
`iter`	the number of iterations for IBM Model 1.
`prob`	the minimum word translation probanility.
`dtfile.path`	if `NULL` (usually for the first time), a data.table will be created contaning cross words of all sentences with their matched probabilities. It saves into a file named as a combination of `name.sorc`, `name.trgt`, `n` and `iter` as "f.e.n.iter.RData". If specific file name is set, it will be read and continue the rest of the function, i.e. : finding dictionary of two given languages.
`name.sorc`	source language's name in mydictionary.
`name.trgt`	traget language's name in mydictionary.

Details

The results depend on the corpus. As an example, we have used English-Persian parallel corpus named Mizan which consists of more than 1,000,000 sentence pairs with a size of 170 Mb. For the 10,000 first sentences, we have a nice dictionary. It just takes 1.356784 mins using an ordinary computer. The results can be found at

http://www.um.ac.ir/~sarmad/word.a/bidictionary.pdf

Value

A list.

`time`	A number. (in second/minute/hour)
`number_input`	An integer.
`Value_prob`	A decimal number between 0 and 1.
`iterIBM1`	An integer.
`dictionary`	A matrix.

Note

Note that we have a memory restriction and just special computers with high cpu and big ram can allocate the vectors of this function. Of course, it depends on corpus size.

In addition, if dtfile.path = NULL, the following question will be asked:

"Are you sure that you want to run the align.ibm1 function (It takes time)? (Yes/ No: if you want to specify word alignment path, please press 'No'.)

Author(s)

Neda Daneshgar and Majid Sarmad.

References

Supreme Council of Information and Communication Technology. (2013), Mizan English-Persian Parallel Corpus. Tehran, I.R. Iran. Retrieved from http://dadegan.ir/catalog/mizan.

http://statmt.org/europarl/v7/bg-en.tgz

Examples

# Since the extraction of  bg-en.tgz in Europarl corpus is time consuming, 
# so the aforementioned unzip files have been temporarily exported to 
# http://www.um.ac.ir/~sarmad/... .

## Not run: 

dic1 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      n = 2000, encode.sorc = 'UTF-8', 
                      name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH')
              
dic2 = bidictionary ('http://www.um.ac.ir/~sarmad/word.a/euro.bg',
                     'http://www.um.ac.ir/~sarmad/word.a/euro.en', 
                      n = 2000, encode.sorc = 'UTF-8', 
                      name.sorc = 'BULGARIAN', name.trgt = 'ENGLISH',
                      remove.pt = FALSE)

## End(Not run)

[Package word.alignment version 1.1 Index]