What are the R sorting rules of character vectors?

前端 未结 2 2085
無奈伤痛
無奈伤痛 2020-11-27 22:53

R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.

For example:

sort(c(\"dog\", \"Cat\", \"Dog\", \"cat\"))
[1] \"cat         


        
相关标签:
2条回答
  • 2020-11-27 23:08

    Details: for sort() states:

     The sort order for character vectors will depend on the collating
     sequence of the locale in use: see ‘Comparison’.  The sort order
     for factors is the order of their levels (which is particularly
     appropriate for ordered factors).
    

    and help(Comparison) then shows:

     Comparison of strings in character vectors is lexicographicwithin
     the strings using the collating sequence of the locale in use:see
     ‘locales’.  The collating sequence of locales such as ‘en_US’ is
     normally different from ‘C’ (which should use ASCII) and can be
     surprising.  Beware of making _any_ assumptions about the 
     collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
     and collation is not necessarily character-by-character - in
     Danish ‘aa’ sorts as a single letter, after ‘z’.  In Welsh ‘ng’
     may or may not be a single sorting unit: if it is it follows ‘g’.
     Some platforms may not respect the locale and always sort in
     numerical order of the bytes in an 8-bit locale, or in Unicode
     point order for a UTF-8 locale (and may not sort in the same order
     for the same language in different character sets).  Collation of
     non-letters (spaces, punctuation signs, hyphens, fractions and so
     on) is even more problematic.
    

    so it depends on your locale setting.

    0 讨论(0)
  • 2020-11-27 23:17

    Sorting depends on locale. My solution for that is the following...

    I create ~/.Renviron file

    cat ~/.Renviron 
    #LC_ALL=C
    

    then in R sorting is in C locale

    x=c("A", "B", "d", "F", "g", "H")
    sort(x)
    #[1] "A" "B" "F" "H" "d" "g"
    
    0 讨论(0)
提交回复
热议问题