问题
Here it looks like the space after the 3
in both rows breaks the numerical sorting and lets the alphabetic sorting kick in, so that 11
<2
:
$ echo -e '3 2\n3 11' | sort -n
3 11
3 2
In man sort
, I read
-s, --stable stabilize sort by disabling last-resort comparison
which implies that without -s
a last-resort comparison is done (between ties, because -s
does not affect non-ties).
So the question is: how is this last-resort comparison accomplished? A reference to the source code would be welcome, if necessary to answer the question.
This answer Unix deduces, from experimentation, that the sorting of ties is lexicographic.
Does the standard/POSIX say anything about this?
回答1:
Here it looks like the space after the 3 in both rows breaks the numerical sorting and lets the alphabetic sorting kick in
sort -n
is not sort -n -k1,1 -k2,2
. sort -n
interprets the whole line (not fields!) as a number, like atoi("3 11")
gives 3
. Then those numbers are sorted. Because sort_them(atoi("3 11"), atoi("3 2"))
are unsorted, because both are number 3
, last-resort comparison sort kicks in.
how is this last-resort comparison accomplished?
The idea is that the whole lines are compared as-if by strcmp
or similar (ie. strcoll
). Because 1
comes before 2
, strcmp("3 11", "3 2")
sorts 3 11
as the first. No options are taken into account, -n
is not taken into account.
A reference to the source code would be welcome, if necessary to answer the question.
It's actually xmemcoll0
in GNU sort to take collating into account in coreutils sort.c#L2653 in compare (struct line const *a, struct line const *b) and there's memcmp
as a fallback when LC_COLLATE
is not set.
I see in openbsd sort it's somehwere around openbsd/sort/coll.c#L528 str_list_coll(struct bwstring *str1, struct sort_list_item **ss2) but also in list_coll_offset(), where if all keys compare equal top_level_str_coll
is called which just sorts the whole lines.
Does the standard/POSIX say anything about this?
If "this" refers to stable sort and last-resort comparision, then sure. Let's copy the whole paragraph from POSIX sort emphasis mine:
Comparisons shall be based on one or more sort keys extracted from each line of input (or, if no sort keys are specified, the entire line up to, but not including, the terminating ), and shall be performed using the collating sequence of the current locale. If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.
and
Implementations are encouraged to perform the recommended further byte-by-byte comparison of lines that collate equally, even though this may affect efficiency. The impact on efficiency can be mitigated by only performing the additional comparison if the current locale's collating sequence does not have a total ordering of all characters (if the implementation provides a way to query this) or by only performing the additional comparison if the locale name associated with the LC_COLLATE category has an '@' modifier in the name (since locales without an '@' modifier should have a total ordering of all characters - see XBD LC_COLLATE). Note that if the implementation provides a stable sort option as an extension (usually -s), the additional comparison should not be performed when this option has been specified.
回答2:
Question: How is the last resort comparison done?
This is quickly answered in documentation of GNU coreutils:
A pair of lines is compared as follows: sort compares each pair of fields (see
--key
), in the order specified on the command line, according to the associated ordering options, until a difference is found or no fields are left. If no key fields are specified, sort uses a default key of the entire line. Finally, as a last resort when all keys compare equal, sort compares entire lines as if no ordering options other than--reverse
(-r
) were specified. The--stable
(-s
) option disables this last-resort comparison so that lines in which all fields compare equal are left in their original relative order. The--unique
(-u
) option also disables the last-resort comparison.Unless otherwise specified, all comparisons use the character collating sequence specified by the
LC_COLLATE
source: Sort Invocation GNU Coreutils
This means that the final resort will sort according to the sorting order of LC_COLLATE, i.e. lexicographically (mostly).
POSIX, on the other hand adds a final ultra-last resort option which is stricter.
If this collating sequence does not have a total ordering of all characters (see XBD LC_COLLATE), any lines of input that collate equally should be further compared byte-by-byte using the collating sequence for the POSIX locale.
source: Sort POSIX standard
I am not certain if this is implemented in GNU sort, since it is not a requirement. Nonetheless, POSIX strongly recommends it (See Rationale last paragraph)
What does this mean in case of the OP?
There is an uncomfortable misunderstanding of the key-definitions. Assume you do something like
$ sort --option -k1,3 file
It is often understood that sort
will first sort on field 1, then 2 and finally 3 using --option
. This is incorrect. It will use the key to be defined as the substring consisting of fields 1 till 3. And in case when two lines collate equally, sort
will perform the last-resort option (see earlier)
aa bb cc xxxxxxxx
--------- <<< rule1: according to the key
------------------ <<< rule2: lexicographical sort (last resort)
Using GNU sort, you can see which substring is used for the sort. This is done with the --debug
option. Here you see the difference between 3 simple cases:
# Sort lexicographically with full line
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort --debug
sort: using ?en_GB.UTF-8? sorting rules
ab c d
______
efg h i
_______
# -------------------------------------------------------------------
# Sort lexicographically with the substring formed by field 1 and 2
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort -k1,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
ab c d
____
______
efg h i
_____
_______
# -------------------------------------------------------------------
# Sort lexicographically with field 1 followed by field 2
# -------------------------------------------------------------------
$ echo -e "ab c d\nefg h i" | sort -k1,1 -k2,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
sort: leading blanks are significant in key 1; consider also specifying 'b'
sort: leading blanks are significant in key 2; consider also specifying 'b'
ab c d
__
__
______
efg h i
___
__
_______
When you do a numeric sort (using -n
or -g
), sort
will attempt to extract a number from the key
(1234abc leads to 1234) and use that number for the sorting.
# Sort numerically with full line
# -------------------------------------------------------------------
$ echo -e "3a 11a\n3b 2b" | sort -n --debug
sort: using ?en_GB.UTF-8? sorting rules
3a 11a
_ # numeric on full line
______ # lexicographically on full line (last resort)
3b 2b
_ # numeric on full line
_____ # lexicographically on full line (last resort)
# -------------------------------------------------------------------
# Sort numerically with field 1 then field 2
# -------------------------------------------------------------------
$ echo -e "3a 11a\n3b 2b" | sort -n -k1,1 -k2,2 --debug
sort: using ?en_GB.UTF-8? sorting rules
3b 2b
_ # numeric on field 1
_ # numeric on field 2
_____ # lexicographically on full line (last resort)
3a 11a
_ # numeric on field 1
__ # numeric on field 2
______ # lexicographically on full line (last resort)
As you notice in these two cases, even though the first field can be ordered lexicographically 3a < 3b
, it is ignored as we only pick the number from the key.
来源:https://stackoverflow.com/questions/65302655/does-sort-n-handle-ties-predictably-when-the-stable-option-is-not-provided-i