How do locales work in Linux / POSIX and what transformations are applied?

﹥>﹥吖頭↗ 提交于 2019-12-29 05:20:43

问题


I'm working with huge files of (I hope) UTF-8 text. I can reproduce it using Ubuntu 13.10 (3.11.0-14-generic) and 12.04.

While investigating a bug I've encountered strange behavoir

$ export LC_ALL=en_US.UTF-8   
$ sort part-r-00000 | uniq -d 
ɥ ɨ ɞ ɧ 251
ɨ ɡ ɞ ɭ ɯ       291
ɢ ɫ ɬ ɜ 301
ɪ ɳ     475
ʈ ʂ     565

$ export LC_ALL=C
$ sort part-r-00000 | uniq -d 
$ # no duplicates found

The duplicates also appear when running a custom C++ program that reads the file using std::stringstream - it fails due to duplicates when using en_US.UTF-8 locale. C++ seems to be unaffected at least for std::string and input/output.

Why are duplicates found when using a UTF-8 locale and no duplicates are found with the C locale?

What transformations does the locale to the text that causes this behavoir?

Edit: Here is a small example

$ uniq -D duplicates.small.nfc 
ɢ ɦ ɟ ɧ ɹ       224
ɬ ɨ ɜ ɪ ɟ       224
ɥ ɨ ɞ ɧ 251
ɯ ɭ ɱ ɪ 251
ɨ ɡ ɞ ɭ ɯ       291
ɬ ɨ ɢ ɦ ɟ       291
ɢ ɫ ɬ ɜ 301
ɧ ɤ ɭ ɪ 301
ɹ ɣ ɫ ɬ 301
ɪ ɳ     475
ͳ ͽ     475
ʈ ʂ     565
ˈ ϡ     565

Output of locale when the problem appears:

$ locale 
LANG=en_US.UTF-8                                                                                                                                                                                               
LC_CTYPE="en_US.UTF-8"                                                                                                                                                                                         
LC_NUMERIC=de_DE.UTF-8                                                                                                                                                                                         
LC_TIME=de_DE.UTF-8                                                                                                                                                                                            
LC_COLLATE="en_US.UTF-8"                                                                                                                                                                                       
LC_MONETARY=de_DE.UTF-8                                                                                                                                                                                        
LC_MESSAGES="en_US.UTF-8"                                                                                                                                                                                      
LC_PAPER=de_DE.UTF-8                                                                                                                                                                                           
LC_NAME=de_DE.UTF-8                                                                                                                                                                                            
LC_ADDRESS=de_DE.UTF-8                                                                                                                                                                                         
LC_TELEPHONE=de_DE.UTF-8                                                                                                                                                                                       
LC_MEASUREMENT=de_DE.UTF-8                                                                                                                                                                                     
LC_IDENTIFICATION=de_DE.UTF-8                                                                                                                                                                                  
LC_ALL=                   

Edit: After normalisation using:

cat duplicates | uconv -f utf8 -t utf8 -x nfc > duplicates.nfc

I still get the same results

Edit: The file is valid UTF-8 according to iconv - (from here)

$ iconv -f UTF-8 duplicates -o /dev/null
$ echo $?
0

Edit: Looks like it something similiar to this: http://xahlee.info/comp/unix_uniq_unicode_bug.html and https://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html

It's working on FreeBSD


回答1:


I have boiled down the problem to an issue with the strcoll() function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq depending on the current locale was:

$ echo -e "\xc9\xa2\n\xc9\xac" > test.txt
$ cat test.txt
ɢ
ɬ
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
ɢ
ɬ

Obviously, if the locale is en_US.UTF-8 uniq treats ɢ and ɬ as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind and investigated both call graphs with kcachegrind.

$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &

The only difference was, that the version with LC_COLLATE=en_US.UTF-8 called strcoll() whereas LC_COLLATE=C did not. So I came up with the following minimal example on strcoll():

#include <iostream>
#include <cstring>
#include <clocale>

int main()
{
    const char* s1 = "\xc9\xa2";
    const char* s2 = "\xc9\xac";
    std::cout << s1 << std::endl;
    std::cout << s2 << std::endl;

    std::setlocale(LC_COLLATE, "en_US.UTF-8");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::setlocale(LC_COLLATE, "C");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::cout << std::endl;

    s1 = "\xa2";
    s2 = "\xac";
    std::cout << s1 << std::endl;
    std::cout << s2 << std::endl;

    std::setlocale(LC_COLLATE, "en_US.UTF-8");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;

    std::setlocale(LC_COLLATE, "C");
    std::cout << std::strcoll(s1, s2) << std::endl;
    std::cout << std::strcmp(s1, s2) << std::endl;
}

Output:

ɢ
ɬ
0
-1
-10
-1

�
�
0
-1
-10
-1

So, what's wrong here? Why does strcoll() returns 0 (equal) for two different characters?




回答2:


It could be due to Unicode normalization. There are sequences of code points in Unicode which are distinct and yet are considered equivalent.

One simple example of that is combining characters. Many accented characters like "é" can be represented as either a single code point (U+00E9, LATIN SMALL LETTER E WITH ACUTE), or as a combination of both an unaccepted character and a combining character, e.g. the two-character sequence <U+0065, U+0301> (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT).

Those two byte sequences are obviously different, and so in the C locale, they compare as different. But in a UTF-8 locale, they're treated as identical due to Unicode normalization.

Here's a simple two-line file with this example:

$ echo -e '\xc3\xa9\ne\xcc\x81' > test.txt
$ cat test.txt
é
é
$ hexdump -C test.txt
00000000  c3 a9 0a 65 cc 81 0a                              |...e...|
00000007
$ LC_ALL=C uniq -d test.txt  # No output
$ LC_ALL=en_US.UTF-8 uniq -d test.txt
é

Edit by n.m. Not all Linux systems do Unicode normalization.




回答3:


Purely conjecture at this point, since we can't see the actual data, but I would guess something like this is going on.

UTF-8 encodes code points 0-127 as their representative byte value. Values above that take two or more bytes. There is a canonical definition of which ranges of values use a certain number of bytes, and the format of those bytes. However, a code point could be encoded in a number of ways. For example - 32, the ASCII space, could be encoded as 0x20 (its canonical encoding), but, it could also be encoded as 0xc0a0. That violates a strict interpretation of the encoding, and so a well formed UTF-8 writing application would never encode it that way. However, decoders are generally written to be more forgiving, to deal with faulty encodings, and so the UTF-8 decoder in your particular situation might be seeing a sequence that isn't a strictly conforming encoded code point and interpreting it in the most reasonable way that it can, which would cause it to see certain multi-byte sequences as equivalent to others. Locale collating sequences would then also have a further effect.

In the C locale, 0x20 would certainly be sorted before 0xc0, but in UTF-8, if it grabs a following 0xa0, then that single byte would be considered equal to the two bytes, and so would sort together.



来源:https://stackoverflow.com/questions/20226851/how-do-locales-work-in-linux-posix-and-what-transformations-are-applied

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!