Given a block of text that\'s known to be Chinese and encoded in UTF-8, is there a way to determine if it\'s Simplified or Traditional?
Since big5
and gb2312
omit quite a few commonly used variants that are present in Unicode, the code rely on exact match between the translit
and ignore
modes would fail in quite a lot of normal use cases: it would fail to identify 説話
as Traditional Chinese despite 説
being a common variant in Hong Kong for 說
which is used in big5
.
A simple fix is to do it in a fuzzy way:
$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
return 'Likely Simplified';
}
return 'Could not identify';