How data.table sorts strings when setting key

前端 未结 2 949
温柔的废话
温柔的废话 2021-01-11 19:03

Yesterday I had to spend some time trying to find a bug in my code and I found that data.table package sorts strings in a way a bit different from base. Is this

相关标签:
2条回答
  • 2021-01-11 19:06

    Well, I am not sure what the most efficient way is but you can do the following to reproduce the data.frame result.

    dt[order(dt$cn)]
    
               cn
    1:     Ubuntu
    2:        USA
    3: Uzbekistan
    
    0 讨论(0)
  • 2021-01-11 19:24

    Update March 2014

    There's been some debate about this one. As of v1.9.2 we've settled for now on setkey sorting using C locale; e.g., all capital letters come before all lower case letters, regardless of user's locale. This was a change made in v1.8.8 which we had intended to reverse but have stuck with for now.

    Consider save()-ing a keyed table in your locale and a colleague load()-ing it in a different locale. When they join to that table it may no longer work correctly if it were locale sort order. We have to think a bit more carefully if setkey is to allow locale ordering again, probably by saving the locale name along with the "sorted" attribute, so data.table can at least compare and detect if the current locale is different to the one that ran setkey.

    It's also for speed reasons as sorting according to locale is much slower than C locale. Although, we can do it as efficiently as possible and allowing it optionally would be ideal.

    Hence, this is now a feature request and further comments are very welcome.

    FR#4842 setkey to sort using session's locale not C locale



    Nice catch! The call to setkey in turn calls setkeyv and that calls fastorder to "order" the columns/entries that in turn calls chorder.

    chorder in turn calls a C function Ccountingcharacter.c. Now, here I suppose the problem comes due to "locale".

    Let's see what "locale" I'm on my mac.

    Sys.getLocale()
    # [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
    

    Now let's see how order sorts it:

    x <- c("USA", "Ubuntu", "Uzbekistan")
    order(x)
    # [1] 2 1 3
    

    Now, let's change the "locale" to "C".

    Sys.setlocale("LC_ALL", "C")
    # [1] "C/C/C/C/C/en_US.UTF-8"
    
    order(x)
    # [1] 1 2 3
    

    From ?order:

    The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison.

    From ?Comparison:

    Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z....

    So, basically, order as well under "C" locale, gives the same order as data.table's setkey. My guess is that the C-function called by chorder automatically runs on C-locale which will compare ascii values for which "S" comes before "b".

    It's probably important to bring this to @MatthewDowle's attention (if he's not already aware of it). So, I'd suggest that you file this as a bug here (just to be sure).

    0 讨论(0)
提交回复
热议问题