I use the iconv library to interface from a modern input source that uses UTF-8 to a legacy system that uses Latin1, aka CP1252 (superset of ISO-8859-1).
The interface recently failed to convert the French string "Éducation", where the "É" was encoded as hex 45 CC 81
. Note that the destination encoding does have an "É" character, encoded as C9
.
Why does iconv fail converting that "É"? I checked that the iconv command-line tool that's available with MacOS X 10.7.3 says it cannot convert, and that the PERL iconv module fails too.
This is all the more puzzling that the precomposed form of the "É" character (encoded as C3 89
) converts just fine.
Is this a bug with iconv or did I miss something?
Note that I also have the same issue if I try to convert from UTF-16 (where "É" is encoded as 00 C9
composed or 00 45 03 01
decomposed).
Unfortunately iconv indeed doesn't deal with the decomposed characters in UTF-8, except the version installed on Mac OS X.
When dealing with Mac file names, you can use iconv with the "utf8-mac" character set option. It also takes into account a few idiosyncrasies of the Mac decomposed form.
However, non-mac versions of iconv or libiconv don't support this, and I could not find the sources used on Mac which provide this support.
I agree with you that iconv should be able to deal with both NFC and NFD forms of UTF8, but until someone patches the sources we have to detect this manually and deal with it before passing stuff to iconv.
Faced with this annoying problem, I used Perl's Unicode::Normalize module as suggested by Jukka.
#!/usr/bin/perl
use Encode qw/decode_utf8 encode_utf8/;
use Unicode::Normalize;
while (<>) {
print encode_utf8( NFC(decode_utf8 $_) );
}
Use a normalizer (in this case, to Normalization Form C) before calling iconv.
A program that deals with character encodings (different representations of characters or, more exactly, code points, as sequences of bytes) and converting between them should be expected to treat precomposed and composed forms as distinct. The decomposed É is two code points and as such distinct from the precomposed É, which is one code point.
来源:https://stackoverflow.com/questions/9892897/why-can-iconv-convert-precomposed-form-but-not-decomposed-form-of-%c3%89-from-utf