Recognizing text as Simplified vs. Traditional Chinese

后端 未结 2 529
南方客
南方客 2021-02-04 18:54

Given a block of text that\'s known to be Chinese and encoded in UTF-8, is there a way to determine if it\'s Simplified or Traditional?

相关标签:
2条回答
  • 2021-02-04 19:01

    I don't know if this will work, but I'd try using iconv to see if it will translate between the charsets correctly, comparing the results from the same conversion with //TRANSLIT and //IGNORE. If the two results match, then the charset conversion hasn't encountered any characters that fail to translate, so you should have a match.

    $test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
    $test2 = iconv("UTF-8", "big5//IGNORE", $text);
    if ($test1 == $test2) {
       echo 'traditional';
    } else {
       $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
       $test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
       if ($test3 == $test4) {
          echo 'simplified';
       } else {
          echo 'Failed to match either traditional or simplified';
       }
    }
    
    0 讨论(0)
  • 2021-02-04 19:12

    Since big5 and gb2312 omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit and ignore modes would fail in quite a lot of normal use cases: it would fail to identify 説話 as Traditional Chinese despite being a common variant in Hong Kong for which is used in big5.

    A simple fix is to do it in a fuzzy way:

    $test1 = iconv("UTF-8", "big5//IGNORE", $text);
    $test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
    $len1 = mb_strlen($test1);
    $len2 = mb_strlen($test2);
    $len0 = mb_strlen($text) * 0.8; // threshold
    if ($len1 > $len2 && $len1 > $len0) {
        return 'Likely Traditional';
    }
    if ($len2 > $len1 && $len2 > $len0) {
        return 'Likely Simplified';
    }
    return 'Could not identify';
    
    0 讨论(0)
提交回复
热议问题