When I use locale
, some characters from my locale (et_EE.UTF-8) are not matched with \\w
and I don\'t see any reason there.
In addition to
Please do not use the broken use locale
pragma.
Please, please, please use Unicode::Collate::Locale
for locale collation. It uses the CLDR rules, and is completely portable and doesn’t rely on dodgy broken POSIX locales, which simply do not work well.
If you sort by code point, you get nonsense, but if you sort using a Unicode::Collate::Locale
object constructed with the Estonian locale,
you get something reasonable:
Codepoint sort: äðõöüŋšžц
Estonian sort: ðŋšžõäöüц
Also, when you do this raw codepoint sort, you are terribly affected by normalization matters. Consider:
NFC/NFD sort by codepoint is DIFFERENT
NFC Codepoint sort: äðõöüŋšžц
NFD Codepoint sort: äõöšüžðŋц
NFC/NFD sort in estonian is SAME
NFC Estonian sort: ðŋšžõäöüц
NFD Estonian sort: ðŋšžõäöüц
And here is the demo program that produced all that.
#!/usr/bin/env perl
#
# et-demo - show how to handle Estonian collation correctly
#
# Tom Christinansen
# Fri Feb 22 19:27:51 MST 2013
use v5.14;
use utf8;
use strict;
use warnings;
use warnings FATAL => "utf8";
use open qw(:std :utf8);
use Unicode::Normalize;
use Unicode::Collate::Locale;
main();
exit();
sub graphemes(_) {
my($str) = @_;
my @graphs = $str =~ /\X/g;
return @graphs;
}
sub same_diff($$) {
my($s1, $s2) = @_;
no locale;
if (NFC($s1) eq NFC($s2)) {
return "SAME";
} else {
return "DIFFERENT";
}
}
sub stringy {
return join("" => @_);
}
sub cp_sort {
no locale;
return sort @_;
}
sub et_sort {
state $collator = # we want Estonian here:
Unicode::Collate::Locale->new(locale => "et");
return $collator->sort(@_);
}
sub main {
my $orig = "õäöüšž ðŋц";
say " Codepoint sort: ", cp_sort(graphemes($orig));
say " Estonian sort: ", et_sort(graphemes($orig));
my $nfc = NFC($orig);
my $nfc_cp_sort = stringy cp_sort(graphemes($nfc));
my $nfc_et_sort = stringy et_sort(graphemes($nfc));
my $nfd = NFD($orig);
my $nfd_cp_sort = stringy cp_sort(graphemes($nfd));
my $nfd_et_sort = stringy et_sort(graphemes($nfd));
say "NFC/NFD sort by codepoint is ",
same_diff($nfc_cp_sort, $nfd_cp_sort);
say "NFC Codepoint sort: ", $nfc_cp_sort;
say "NFD Codepoint sort: ", $nfd_cp_sort;
say "NFC/NFD sort in estonian is ",
same_diff($nfc_et_sort, $nfd_et_sort);
say "NFC Estonian sort: ", $nfc_et_sort;
say "NFD Estonian sort: ", $nfd_et_sort;
}
That really is how you should be handling locale collation. See also this answer for numerous examples.