R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.
For example:
sort(c(\"dog\", \"Cat\", \"Dog\", \"cat\"))
[1] \"cat
Details:
for sort()
states:
The sort order for character vectors will depend on the collating sequence of the locale in use: see ‘Comparison’. The sort order for factors is the order of their levels (which is particularly appropriate for ordered factors).
and help(Comparison)
then shows:
Comparison of strings in character vectors is lexicographicwithin the strings using the collating sequence of the locale in use:see ‘locales’. The collating sequence of locales such as ‘en_US’ is normally different from ‘C’ (which should use ASCII) and can be surprising. Beware of making _any_ assumptions about the collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’, and collation is not necessarily character-by-character - in Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’ may or may not be a single sorting unit: if it is it follows ‘g’. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.
so it depends on your locale setting.
Sorting depends on locale. My solution for that is the following...
I create ~/.Renviron
file
cat ~/.Renviron
#LC_ALL=C
then in R sorting is in C locale
x=c("A", "B", "d", "F", "g", "H")
sort(x)
#[1] "A" "B" "F" "H" "d" "g"