Why don’t word characters (\w) match right under the use locale pragma?

前端未结

关注

 1  1107

When I use locale, some characters from my locale (et_EE.UTF-8) are not matched with \\w and I don\'t see any reason there.

In addition to

相关标签:

1条回答

广开言路

2021-01-16 14:54

Please do not use the broken use locale pragma.

Please, please, please use Unicode::Collate::Locale for locale collation. It uses the CLDR rules, and is completely portable and doesn’t rely on dodgy broken POSIX locales, which simply do not work well.

If you sort by code point, you get nonsense, but if you sort using a Unicode::Collate::Locale object constructed with the Estonian locale, you get something reasonable:

Codepoint sort:  äðõöüŋšžц
Estonian  sort:  ðŋšžõäöüц

Also, when you do this raw codepoint sort, you are terribly affected by normalization matters. Consider:

NFC/NFD sort by codepoint is DIFFERENT
NFC Codepoint sort:  äðõöüŋšžц
NFD Codepoint sort:  äõöšüžðŋц

NFC/NFD sort in estonian  is SAME
NFC Estonian  sort:  ðŋšžõäöüц
NFD Estonian  sort:  ðŋšžõäöüц

And here is the demo program that produced all that.

#!/usr/bin/env perl
#
# et-demo - show how to handle Estonian collation correctly
#
# Tom Christinansen <tchrist@perl.com>
# Fri Feb 22 19:27:51 MST 2013

use v5.14;
use utf8;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:std :utf8);

use Unicode::Normalize;
use Unicode::Collate::Locale;

main();
exit();

sub graphemes(_) {
    my($str) = @_;
    my @graphs = $str =~ /\X/g;
    return @graphs;
}

sub same_diff($$) {
    my($s1, $s2) = @_;
    no locale;

    if (NFC($s1) eq NFC($s2)) {
        return "SAME";
    } else {
        return "DIFFERENT";
    }
}

sub stringy {
    return join("" => @_);
}

sub cp_sort {
    no locale;
    return sort @_;
}

sub et_sort {
    state $collator = # we want Estonian here:
        Unicode::Collate::Locale->new(locale => "et");
    return $collator->sort(@_);
}

sub main {
    my $orig = "õäöüšž ðŋц";

    say "    Codepoint sort: ", cp_sort(graphemes($orig));
    say "    Estonian  sort: ", et_sort(graphemes($orig));

    my $nfc = NFC($orig);
    my $nfc_cp_sort = stringy cp_sort(graphemes($nfc));
    my $nfc_et_sort = stringy et_sort(graphemes($nfc));

    my $nfd = NFD($orig);
    my $nfd_cp_sort = stringy cp_sort(graphemes($nfd));
    my $nfd_et_sort = stringy et_sort(graphemes($nfd));

    say "NFC/NFD sort by codepoint is ",
        same_diff($nfc_cp_sort, $nfd_cp_sort);

    say "NFC Codepoint sort: ", $nfc_cp_sort;
    say "NFD Codepoint sort: ", $nfd_cp_sort;

    say "NFC/NFD sort in estonian  is ",
        same_diff($nfc_et_sort, $nfd_et_sort);

    say "NFC Estonian  sort: ", $nfc_et_sort;
    say "NFD Estonian  sort: ", $nfd_et_sort;

}

That really is how you should be handling locale collation. See also this answer for numerous examples.

0 讨论(0)