I'm working with huge files of (I hope) UTF-8 text. I can reproduce it using Ubuntu 13.10 (3.11.0-14-generic) and 12.04.
While investigating a bug I've encountered strange behavoir
$ export LC_ALL=en_US.UTF-8
$ sort part-r-00000 | uniq -d
ɥ ɨ ɞ ɧ 251
ɨ ɡ ɞ ɭ ɯ 291
ɢ ɫ ɬ ɜ 301
ɪ ɳ 475
ʈ ʂ 565
$ export LC_ALL=C
$ sort part-r-00000 | uniq -d
$ # no duplicates found
The duplicates also appear when running a custom C++ program that reads the file using
C++ seems to be unaffected at least for std::stringstream
- it fails due to duplicates when using en_US.UTF-8
locale.std::string
and input/output.
Why are duplicates found when using a UTF-8 locale and no duplicates are found with the C locale?
What transformations does the locale to the text that causes this behavoir?
Edit: Here is a small example
$ uniq -D duplicates.small.nfc
ɢ ɦ ɟ ɧ ɹ 224
ɬ ɨ ɜ ɪ ɟ 224
ɥ ɨ ɞ ɧ 251
ɯ ɭ ɱ ɪ 251
ɨ ɡ ɞ ɭ ɯ 291
ɬ ɨ ɢ ɦ ɟ 291
ɢ ɫ ɬ ɜ 301
ɧ ɤ ɭ ɪ 301
ɹ ɣ ɫ ɬ 301
ɪ ɳ 475
ͳ ͽ 475
ʈ ʂ 565
ˈ ϡ 565
Output of locale
when the problem appears:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC=de_DE.UTF-8
LC_TIME=de_DE.UTF-8
LC_COLLATE="en_US.UTF-8"
LC_MONETARY=de_DE.UTF-8
LC_MESSAGES="en_US.UTF-8"
LC_PAPER=de_DE.UTF-8
LC_NAME=de_DE.UTF-8
LC_ADDRESS=de_DE.UTF-8
LC_TELEPHONE=de_DE.UTF-8
LC_MEASUREMENT=de_DE.UTF-8
LC_IDENTIFICATION=de_DE.UTF-8
LC_ALL=
Edit: After normalisation using:
cat duplicates | uconv -f utf8 -t utf8 -x nfc > duplicates.nfc
I still get the same results
Edit: The file is valid UTF-8 according to iconv
- (from here)
$ iconv -f UTF-8 duplicates -o /dev/null
$ echo $?
0
Edit: Looks like it something similiar to this: http://xahlee.info/comp/unix_uniq_unicode_bug.html and https://lists.gnu.org/archive/html/bug-coreutils/2012-07/msg00072.html
It's working on FreeBSD
I have boiled down the problem to an issue with the strcoll()
function, which is not related to Unicode normalization. Recap: My minimal example that demonstrates the different behaviour of uniq
depending on the current locale was:
$ echo -e "\xc9\xa2\n\xc9\xac" > test.txt
$ cat test.txt
ɢ
ɬ
$ LC_COLLATE=C uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 uniq -D test.txt
ɢ
ɬ
Obviously, if the locale is en_US.UTF-8
uniq
treats ɢ
and ɬ
as duplicates, which shouldn't be the case. I then ran the same commands again with valgrind
and investigated both call graphs with kcachegrind
.
$ LC_COLLATE=C valgrind --tool=callgrind uniq -D test.txt
$ LC_COLLATE=en_US.UTF-8 valgrind --tool=callgrind uniq -D test.txt
$ kcachegrind callgrind.out.5754 &
$ kcachegrind callgrind.out.5763 &
The only difference was, that the version with LC_COLLATE=en_US.UTF-8
called strcoll()
whereas LC_COLLATE=C
did not. So I came up with the following minimal example on strcoll()
:
#include <iostream>
#include <cstring>
#include <clocale>
int main()
{
const char* s1 = "\xc9\xa2";
const char* s2 = "\xc9\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;
std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::cout << std::endl;
s1 = "\xa2";
s2 = "\xac";
std::cout << s1 << std::endl;
std::cout << s2 << std::endl;
std::setlocale(LC_COLLATE, "en_US.UTF-8");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
std::setlocale(LC_COLLATE, "C");
std::cout << std::strcoll(s1, s2) << std::endl;
std::cout << std::strcmp(s1, s2) << std::endl;
}
Output:
ɢ
ɬ
0
-1
-10
-1
�
�
0
-1
-10
-1
So, what's wrong here? Why does strcoll()
returns 0
(equal) for two different characters?
It could be due to Unicode normalization. There are sequences of code points in Unicode which are distinct and yet are considered equivalent.
One simple example of that is combining characters. Many accented characters like "é" can be represented as either a single code point (U+00E9, LATIN SMALL LETTER E WITH ACUTE), or as a combination of both an unaccepted character and a combining character, e.g. the two-character sequence <U+0065, U+0301> (LATIN SMALL LETTER E, COMBINING ACUTE ACCENT).
Those two byte sequences are obviously different, and so in the C locale, they compare as different. But in a UTF-8 locale, they're treated as identical due to Unicode normalization.
Here's a simple two-line file with this example:
$ echo -e '\xc3\xa9\ne\xcc\x81' > test.txt
$ cat test.txt
é
é
$ hexdump -C test.txt
00000000 c3 a9 0a 65 cc 81 0a |...e...|
00000007
$ LC_ALL=C uniq -d test.txt # No output
$ LC_ALL=en_US.UTF-8 uniq -d test.txt
é
Edit by n.m. Not all Linux systems do Unicode normalization.
Purely conjecture at this point, since we can't see the actual data, but I would guess something like this is going on.
UTF-8 encodes code points 0-127 as their representative byte value. Values above that take two or more bytes. There is a canonical definition of which ranges of values use a certain number of bytes, and the format of those bytes. However, a code point could be encoded in a number of ways. For example - 32, the ASCII space, could be encoded as 0x20 (its canonical encoding), but, it could also be encoded as 0xc0a0. That violates a strict interpretation of the encoding, and so a well formed UTF-8 writing application would never encode it that way. However, decoders are generally written to be more forgiving, to deal with faulty encodings, and so the UTF-8 decoder in your particular situation might be seeing a sequence that isn't a strictly conforming encoded code point and interpreting it in the most reasonable way that it can, which would cause it to see certain multi-byte sequences as equivalent to others. Locale collating sequences would then also have a further effect.
In the C locale, 0x20 would certainly be sorted before 0xc0, but in UTF-8, if it grabs a following 0xa0, then that single byte would be considered equal to the two bytes, and so would sort together.
来源:https://stackoverflow.com/questions/20226851/how-do-locales-work-in-linux-posix-and-what-transformations-are-applied