Strange collation with postgresql

问题

I noticed a strange collation issue with postgresql-9.5 as it was giving different output to a Python script. As I understand it, normally characters are compared one at a time from left to right when sorting:

select 'ab' < 'ac';
 t

select 'abX' < 'ac';
 t

So it's irrelevant if you add the 'X' to the left hand string above.

So I was surprised when this does not hold for comparison between a space and a dash:

select 'a ' < 'a-';
 t

select 'a X' < 'a-';
 f

Is it a bug or is there any way around this; I want the last statement above to also return true.

[Edit after accepting @laurenz-albe's answer]

show lc_collate;
 en_US.UTF-8

If I do sorting by unicode in Python I get similar results so it's not a bug in postgresql, but a bug in the unicode collation spec in my opinion!:

>>> import locale; locale.setlocale(locale.LC_ALL, "")
'en_US.UTF-8'
>>> l = ['ac', 'ab']; sorted(l) == sorted(l, cmp=locale.strcoll)
True
>>> l = ['ac', 'abX']; sorted(l) == sorted(l, cmp=locale.strcoll)
True
>>> l = ['ac', 'abX']; sorted(l) == sorted(l, cmp=locale.strcoll)
True
>>> l = ['a-', 'a X']; sorted(l) == sorted(l, cmp=locale.strcoll)
False
>>> sorted(l),  sorted(l, cmp=locale.strcoll)
(['a X', 'a-'], ['a-', 'a X'])

回答1:

Characters are compared using a collation. You can use the SQL command

SHOW lc_collate;

to see what your default collation is.

PostgreSQL uses the operating system's collations, so the result of a comparison will often depend on the operating system.

To force byte-wise ASCII comparison, use the C collation

test=> SELECT 'a X' COLLATE "C" < 'a-';
 ?column?
----------
 t
(1 row)

or the byte-wise comparison operators

test=> SELECT 'a X' ~<~ 'a-';
 ?column?
----------
 t
(1 row)

来源：https://stackoverflow.com/questions/37836582/strange-collation-with-postgresql

标签

postgresql

sorting

collation

utf