Transliterate CJK to Latin — preferably in C++ [closed]

不想你离开。 提交于 2020-01-06 07:00:50

问题


I am trying to write a program that can transliterate CJK to Latin (i.e Pinyin, Romaji, etc.). For example you give a Chinese, Japanese or Korean document as input and then you get the transliterated version into Latin as output.

I am new in this field so please bear with me here.

Obviously, first I need to detect the type of the language (Chinese, Japanese or Korean) before getting any further. Then, as I understood so far, in order to do the transliteration, I need to divide the text into words, since in these languages there is no space between words. This is called word segmentation. Finally after finding out the words I need to transliterate them into Latin.

So here is my question:

  1. There are lots of (well not really! Better say some) libraries that do the transliteration job, since I'm looking for open source ones in C/C++, I found Adson (only for Chinese) and ICU4C. Cloned Git repo from Adson didn't compile. And I was not able to find simple, straight forward tutorial for ICU4C. How can I find some tutorial on ICU4C usage? Do you know any other library to transliterate CJK to Latin? If the accuracy ratio is higher(~90%), I can forget about it being written in C++.

回答1:


ICU: there are examples in http://userguide.icu-project.org/transforms/general and ICU 50 now has CJK word segmentation. The uconv sample can be used with something like uconv -f utf-8 -t utf-8 -x 'Any-Latin' to go through Any-Latin transform. That doesn't take language into account, though.



来源:https://stackoverflow.com/questions/13455282/transliterate-cjk-to-latin-preferably-in-c

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!