What is the difference between strcmp() and strcoll()?

前端 未结 2 1642
一整个雨季
一整个雨季 2021-01-02 02:18

I tried understanding both of them but I did not find any differences except for strcoll() this reference says that it

compares two null

2条回答
  •  伪装坚强ぢ
    2021-01-02 02:50

    For some reason in all unicode locales I tested, on several different versions of glibc, strcoll() returns zero for any two hiraganas. This breaks sort, uniq, and everything that interacts with orders of strings in some way.

    $ echo -e -n 'い\nろ\nは\nに\nほ\nへ\nと\n' | sort | uniq

    which is simply broken beyond repair. People from different places of world might have different ideas on whether 'い' should be placed before or after 'ろ', but nobody sane would consider them the same.

    And no, setting your locale to the Japanese one does not matter:

    $ LC_ALL=ja_JP.utf8 LANG=ja_JP.utf8 LC_COLLATE=ja_JP.utf8 echo -e -n 'い\nろ\nは\nに\nほ\nへ\nと\n' | sort | uniq

    There was discussion in some official mailing list, but guess what, it was in 2002 and it was never fixed because people don't care: https://www.mail-archive.com/linux-utf8@nl.linux.org/msg02658.html

    That bug happened to us in some day and in the end our only way out was to set the collate locale to "C" and rely on the nice properties of utf-8 encoding. That's a horrible experience, since one shouldn't really work under "C" locale when processing all-Japanese data.

    So for your sanity's sake, do NOT directly use strcoll. A safer variant might be:

    int safe_strcoll(const char *a, const char *b)
    {
      int ret = strcoll(a, b);
      if (ret != 0) return ret;
      return strcmp(a, b);
    }
    

    just in case strcoll() decides to screw you...

    EDIT: I just repeated the experiment out of curiosity, and my current system (with glibc 2.29) works without problems now. Locale doesn't matter either.

提交回复
热议问题