问题
I am trying to write a program that can transliterate CJK to Latin (i.e Pinyin, Romaji, etc.). For example you give a Chinese, Japanese or Korean document as input and then you get the transliterated version into Latin as output.
I am new in this field so please bear with me here.
Obviously, first I need to detect the type of the language (Chinese, Japanese or Korean) before getting any further. Then, as I understood so far, in order to do the transliteration, I need to divide the text into words, since in these languages there is no space between words. This is called word segmentation. Finally after finding out the words I need to transliterate them into Latin.
So here is my question:
- There are lots of (well not really! Better say some) libraries that do the transliteration job, since I'm looking for open source ones in C/C++, I found Adson (only for Chinese) and ICU4C. Cloned Git repo from Adson didn't compile. And I was not able to find simple, straight forward tutorial for ICU4C. How can I find some tutorial on ICU4C usage? Do you know any other library to transliterate CJK to Latin? If the accuracy ratio is higher(~90%), I can forget about it being written in C++.
回答1:
ICU: there are examples in http://userguide.icu-project.org/transforms/general and ICU 50 now has CJK word segmentation. The uconv
sample can be used with something like uconv -f utf-8 -t utf-8 -x 'Any-Latin'
to go through Any-Latin transform. That doesn't take language into account, though.
来源:https://stackoverflow.com/questions/13455282/transliterate-cjk-to-latin-preferably-in-c