Statistical Machine Translation from Hindi to English using MOSES

点点圈 提交于 2019-12-07 18:38:48

问题


I need to create a Hindi to English translation system using MOSES. I have got a parallel corpora containing about 10000 Hindi sentences and corresponding English translations. I followed the method described in the Baseline system creation page. But, just in the first stage, when I wanted to tokenise my Hindi corpus and tried to execute

~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l hi < ~/corpus/training/hi-en.hi> ~/corpus/hi-en.tok.hi

, the tokeniser gave me the following output:

Tokenizer Version 1.1
Language: hi
Number of threads: 1
WARNING: No known abbreviations for language 'hi', attempting fall-back to English version...

I even tried with 'hin' but it still didn't recognise the language. Can anyone tell the correct way to make the translation system.


回答1:


Moses does not support Hindi for tokenization, the tokenizer.perl uses the nonbreaking_prefix.* files (from https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl#L516)

The languages available with nonbreaking prefixes from Moses are:

  • ca: Catalan
  • cs: Czech
  • de: German
  • el: Greek
  • en: English
  • es: Spanish
  • fi: Finnish
  • fr: French
  • hu: Hungarian
  • is: Icelandic
  • it: Italian
  • lv: Latvian
  • nl: Dutch
  • pl: Polish
  • pt: Portugese
  • ro: Romanian
  • ru: Russian
  • sk: Slovak
  • sl: Slovene
  • sv: Swedish
  • ta: Tamil

from https://github.com/moses-smt/mosesdecoder/tree/master/scripts/share/nonbreaking_prefixes


However all hope is not lost, you can surely tokenize your text with other tokenizers before training machine translation model with Moses, try Googling "Hindi Tokenziers", there are tonnes of them around.



来源:https://stackoverflow.com/questions/27669446/statistical-machine-translation-from-hindi-to-english-using-moses

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!