Why can iconv convert precomposed form but not decomposed form of “É” (from UTF-8 to CP1252)

…衆ロ難τιáo~ 提交于 2019-12-03 12:40:14
mivk

Unfortunately iconv indeed doesn't deal with the decomposed characters in UTF-8, except the version installed on Mac OS X.

When dealing with Mac file names, you can use iconv with the "utf8-mac" character set option. It also takes into account a few idiosyncrasies of the Mac decomposed form.

However, non-mac versions of iconv or libiconv don't support this, and I could not find the sources used on Mac which provide this support.

I agree with you that iconv should be able to deal with both NFC and NFD forms of UTF8, but until someone patches the sources we have to detect this manually and deal with it before passing stuff to iconv.

Faced with this annoying problem, I used Perl's Unicode::Normalize module as suggested by Jukka.

#!/usr/bin/perl

use Encode qw/decode_utf8 encode_utf8/;
use Unicode::Normalize;

while (<>) {
    print encode_utf8( NFC(decode_utf8 $_) );
}

Use a normalizer (in this case, to Normalization Form C) before calling iconv.

A program that deals with character encodings (different representations of characters or, more exactly, code points, as sequences of bytes) and converting between them should be expected to treat precomposed and composed forms as distinct. The decomposed É is two code points and as such distinct from the precomposed É, which is one code point.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!