Sort values using a specific collation in Ruby/Rails

前端 未结 2 1862
忘了有多久
忘了有多久 2021-01-02 01:14

Is it possible to sort an array of values using a specific collation in Ruby? I have a need to sort according to the da_DK collation.

Given the array %w(Aarhu

相关标签:
2条回答
  • 2021-01-02 01:38

    I found the ffi-locale on Github and that solves my problem as far as I can see.

    It allows the following code:

    FFILocale::setlocale FFILocale::LC_COLLATE, 'da_DK.UTF-8'
    %w(Aarhus Aalborg Assens).sort { |a,b| FFILocale::strcoll(a, b) }
    

    Which returns the correct result:

    => ["Assens", "Aalborg", "Aarhus"]
    

    I haven't investigated performance yet but it calls out to native code so it ought to be faster that Ruby character replacement code...

    Update
    It is not perfect :( It does not work properly on Snow Leopard - it seems that the strcoll function is broken on OS X and have been for some time. It is annoying to me but the main platform for deployment is linux - where it works - so it is my currently preferred solution.

    0 讨论(0)
  • 2021-01-02 01:44

    According to Wikipedia:

    In the Danish and Norwegian alphabets, the same extra vowels as in Swedish (see below) are also present but in a different order and with different glyphs (..., X, Y, Z, Æ, Ø, Å). Also, "Aa" collates as an equivalent to "Å". The Danish alphabet has traditionally seen "W" as a variant of "V", but today "W" is considered a separate letter."

    This would throw off sorting.

    Do this to fix the problem:

    names = %w(Aarhus Aalborg Assens)
    names.sort_by { |w| w.gsub('Aa', 'Å') } # => ["Assens", "Aalborg", "Aarhus"]
    

    and something similar for the other letters that have compound character combinations to convert to the single character.

    The reason this works is sort_by does a Schwartzian Transformation, so it's actually sorting by the return value returned from the block, which, in this case, is the name with 'Aa' replaced with 'Å'. The replacement is temporary, and discarded when the array is sorted.

    sort_by is very powerful, but does have some overhead. For a simple sort you should use sort because its faster. For sorts where you're comparing two simple values at the top level of an object then it becomes a wash whether you should use sort or sort_by. If you have to do more complex calculations or dig around in an object then sort_by can prove to be faster. There isn't a real hard-and-fast way to know which is better, so I strongly recommend testing with a benchmark if you have to sort large arrays or deal with objects because the difference can be large, and sometimes sort can be the better choice.

    EDIT:

    Ruby, by itself, isn't going to do what you want, because it has no knowledge of the sort order of every character set out there. There's a discussion regarding incorporating IBM's ICU that explains why that is. If you want ICU's abilities, you could look into ICU4R. I haven't played with it, but it sounds like your only real solution in Ruby.

    You might be able to do something with a database like Postgres. They support various collating options but usually force you to declare the collation when you create the database... or maybe it's when the table is created... it's been a while since I created a new table. Anyway, that'd be an option, though it would be a pain.

    0 讨论(0)
提交回复
热议问题