How to transliterate non-latin scripts?

问题

I'm playing around with transliteration in PHP using iconv. Particularly I want to normalise accented characters and Romanize other scripts from UTF-8 to plain ASCII.

While many characters work, (such as Ž->Z) others are giving odd results or raising errors.

For example, E ACUTE é (U+00E9) transliterates to ASCII with a single quote (U+0027) preceding the e as if it's trying to represent the diacritic mark I'm trying to get rid of.

$utf_8 = "\xC3\xA9"; // <- é
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// returns "'e", not "e"

Non-latin scripts are worse, for example Greek sigma Σ (U+03A3) which should transliterate to latin S is not recognised at all and raises an error:

$utf_8 = "\xCE\xA3"; // <- Σ
$ascii = iconv( 'UTF-8', 'ASCII//TRANSLIT', $utf_8 );
// Raises notice: iconv(): Detected an illegal character in input string

I can just about cope with the first one, but how can I transliterate "Σ" to "S", and do this reliably across other scripts that have equivalent characters?

I don't mind generating my own tables if there is a good source that works for most european languages.

Note that I've tried various collation tables, which are useful for normalising accented latin characters, but they don't work for transliterating between scripts.

回答1:

I've not had much luck using iconv. It always manages to throw a bunch of notices.

The best luck I've had is with using a custom transliteration table. It's far from perfect but at least you'll feel like you have some solid ground.

I've not found a good single source for transliteration tables. My unfamiliarity with anything but the latin script isn't helping.

回答2:

I've attempted something similar - it's mainly based off Doctrine 1 code and isn't perfect: but it seemed to work with all the test data I threw at it.

来源：https://stackoverflow.com/questions/17863798/how-to-transliterate-non-latin-scripts

标签

php

localization

iconv