How to convert Chinese characters to Pinyin

后端 未结 8 929
渐次进展
渐次进展 2020-12-23 15:10

For sorting Chinese language text, I want to convert Chinese characters to Pinyin, properly separating each Chinese character and grouping successive characters together.

相关标签:
8条回答
  • 2020-12-23 15:46

    i had this problem and i found a solution in PHP (which could be cleaner i suppose but it works). I had some troubles because the file given in this topic is from hexa unicode.

    1) Import the data from ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/Uni2Pinyin.gz (thanks pierr) to your database or whatever

    2) Import your data in an array as $pinyinArray[$hexaUnicode] = $pinyin;

    3) Use this code:

    /*
     * Decimal representation of $c
     * function found there: http://www.cantonese.sheik.co.uk/phorum/read.php?2,19594
     */
    function uniord($c)
    {
    $ud = 0;
    if (ord($c{0})>=0 && ord($c{0})<=127)
        $ud = $c{0};
    if (ord($c{0})>=192 && ord($c{0})<=223)
        $ud = (ord($c{0})-192)*64 + (ord($c{1})-128);
    if (ord($c{0})>=224 && ord($c{0})<=239)
        $ud = (ord($c{0})-224)*4096 + (ord($c{1})-128)*64 + (ord($c{2})-128);
    if (ord($c{0})>=240 && ord($c{0})<=247)
        $ud = (ord($c{0})-240)*262144 + (ord($c{1})-128)*4096 + (ord($c{2})-128)*64 + (ord($c{3})-128);
    if (ord($c{0})>=248 && ord($c{0})<=251)
        $ud = (ord($c{0})-248)*16777216 + (ord($c{1})-128)*262144 + (ord($c{2})-128)*4096 + (ord($c{3})-128)*64 + (ord($c{4})-128);
    if (ord($c{0})>=252 && ord($c{0})<=253)
        $ud = (ord($c{0})-252)*1073741824 + (ord($c{1})-128)*16777216 + (ord($c{2})-128)*262144 + (ord($c{3})-128)*4096 + (ord($c{4})-128)*64 + (ord($c{5})-128);
    if (ord($c{0})>=254 && ord($c{0})<=255) //error
        $ud = false;
    return $ud;
    }
    /*
     * Translate the $string string of a single chinese charactere to unicode
     */
    function chineseToHexaUnicode($string) {
        return strtoupper(dechex(uniord($string)));
    }
    /*
     * 
     */
    function convertChineseToPinyin($string,$pinyinArray) {
        $pinyinValue = '';
        for ($i = 0; $i < mb_strlen($string);$i++)
            $pinyinValue.=$pinyinArray[chineseToHexaUnicode(mb_substr($string, $i, 1))];
        return $pinyinValue;
    }
    
    $string = '龙江省五大';
    echo convertChineseToPinyin($string,$pinyinArray);
    

    echo: (long2)(jiang1)(sheng3,xing3)(wu3)(da4,dai4)

    Of course, $pinyinArray is your array of data (hexoUnicode => pinyin)

    Hope it will help someone.

    0 讨论(0)
  • 2020-12-23 15:48

    You can use the following method:

    from __future__ import unicode_literals
    from pypinyin import lazy_pinyin
    
    hanzi_list = ['如何', '将', '汉字','转为', '拼音']
    pinyin_list = [''.join(lazy_pinyin(_)) for _ in hanzi_list]
    

    Output:

    ['ruhe', 'jiang', 'hanzi', 'zhuanwei', 'pinyin']
    
    0 讨论(0)
提交回复
热议问题