Why don’t word characters (\w) match right under the use locale pragma?

前端 未结 1 1106
清酒与你
清酒与你 2021-01-16 14:32

When I use locale, some characters from my locale (et_EE.UTF-8) are not matched with \\w and I don\'t see any reason there.

In addition to

相关标签:
1条回答
  • 2021-01-16 14:54

    Please do not use the broken use locale pragma.

    Please, please, please use Unicode::Collate::Locale for locale collation. It uses the CLDR rules, and is completely portable and doesn’t rely on dodgy broken POSIX locales, which simply do not work well.

    If you sort by code point, you get nonsense, but if you sort using a Unicode::Collate::Locale object constructed with the Estonian locale, you get something reasonable:

    Codepoint sort:  äðõöüŋšžц
    Estonian  sort:  ðŋšžõäöüц
    

    Also, when you do this raw codepoint sort, you are terribly affected by normalization matters. Consider:

    NFC/NFD sort by codepoint is DIFFERENT
    NFC Codepoint sort:  äðõöüŋšžц
    NFD Codepoint sort:  äõöšüžðŋц
    
    NFC/NFD sort in estonian  is SAME
    NFC Estonian  sort:  ðŋšžõäöüц
    NFD Estonian  sort:  ðŋšžõäöüц
    

    And here is the demo program that produced all that.

    #!/usr/bin/env perl
    #
    # et-demo - show how to handle Estonian collation correctly
    #
    # Tom Christinansen <tchrist@perl.com>
    # Fri Feb 22 19:27:51 MST 2013
    
    use v5.14;
    use utf8;
    use strict;
    use warnings;
    use warnings FATAL => "utf8";
    use open qw(:std :utf8);
    
    use Unicode::Normalize;
    use Unicode::Collate::Locale;
    
    main();
    exit();
    
    sub graphemes(_) {
        my($str) = @_;
        my @graphs = $str =~ /\X/g;
        return @graphs;
    }
    
    sub same_diff($$) {
        my($s1, $s2) = @_;
        no locale;
    
        if (NFC($s1) eq NFC($s2)) {
            return "SAME";
        } else {
            return "DIFFERENT";
        }
    }
    
    sub stringy {
        return join("" => @_);
    }
    
    sub cp_sort {
        no locale;
        return sort @_;
    }
    
    sub et_sort {
        state $collator = # we want Estonian here:
            Unicode::Collate::Locale->new(locale => "et");
        return $collator->sort(@_);
    }
    
    sub main {
        my $orig = "õäöüšž ðŋц";
    
        say "    Codepoint sort: ", cp_sort(graphemes($orig));
        say "    Estonian  sort: ", et_sort(graphemes($orig));
    
        my $nfc = NFC($orig);
        my $nfc_cp_sort = stringy cp_sort(graphemes($nfc));
        my $nfc_et_sort = stringy et_sort(graphemes($nfc));
    
        my $nfd = NFD($orig);
        my $nfd_cp_sort = stringy cp_sort(graphemes($nfd));
        my $nfd_et_sort = stringy et_sort(graphemes($nfd));
    
        say "NFC/NFD sort by codepoint is ",
            same_diff($nfc_cp_sort, $nfd_cp_sort);
    
        say "NFC Codepoint sort: ", $nfc_cp_sort;
        say "NFD Codepoint sort: ", $nfd_cp_sort;
    
        say "NFC/NFD sort in estonian  is ",
            same_diff($nfc_et_sort, $nfd_et_sort);
    
        say "NFC Estonian  sort: ", $nfc_et_sort;
        say "NFD Estonian  sort: ", $nfd_et_sort;
    
    }
    

    That really is how you should be handling locale collation. See also this answer for numerous examples.

    0 讨论(0)
提交回复
热议问题