How to remove diacritics in Perl 6

霸气de小男生 提交于 2019-12-05 06:16:18

Perl 6 has great Unicode processing support in the Str class. To do what you are asking in (1), you can use the samemark method/routine.

Per the documentation:

multi sub samemark(Str:D $string, Str:D $pattern --> Str:D)
method    samemark(Str:D: Str:D $pattern --> Str:D)

Returns a copy of $string with the mark/accent information for each character changed such that it matches the mark/accent of the corresponding character in $pattern. If $string is longer than $pattern, the remaining characters in $string receive the same mark/accent as the last character in $pattern. If $pattern is empty no changes will be made.

Examples:

say 'åäö'.samemark('aäo');                        # OUTPUT: «aäo␤» 
say 'åäö'.samemark('a');                          # OUTPUT: «aao␤» 

say samemark('Pêrl', 'a');                        # OUTPUT: «Perl␤» 
say samemark('aöä', '');                          # OUTPUT: «aöä␤» 

This can be used both to remove marks/diacritics from letters, as well as to add them.

For (2), there are a few ways to do this (TIMTOWTDI). If you want a list of all the codepoints in a string, you can use the ords method to get a List (technically a Positional) of all the codepoints in the string.

say "p̄".ords;                  # OUTPUT: «(112 772)␤»

You can use the uniname method/routine to get the Unicode name for a codepoint:

.uniname.say for "p̄".ords;     # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»

or just use the uninames method/routine:

.say for "p̄".uninames;         # OUTPUT: «LATIN SMALL LETTER P␤COMBINING MACRON␤»

If you just want the number of codepoints in the string, you can use codes:

say "p̄".codes;                 # OUTPUT: «2␤»

This is different than chars, which just counts the number of characters in the string:

say "p̄".chars;                 # OUTPUT: «1␤»

Also see @hobbs' answer using NFD.

This is the best I was able to come up with from the docs — there might be a simpler way, but I'm not sure.

my $in = "Él está un pingüino";
my $stripped = Uni.new($in.NFD.grep: { !uniprop($_, 'Grapheme_Extend') }).Str;
say $stripped; # El esta un pinguino

The .NFD method converts the string to normalization form D (decomposed), which separates graphemes out into base codepoints and combining codepoints whenever possible. The grep then returns a list of only those codepoints that don't have the "Grapheme_Extend" property, i.e. it removes the combining codepoints. the Uni.new(...).Str then assembles those codepoints back into a string.

You can also put these pieces together to answer your second question; e.g.:

$in.NFD.map: { Uni.new($_).Str }

will return a list of 1-character strings, each with a single decomposed codepoint, or

$in.NFD.map(&uniname).join("\n")

will make a nice little unicode debugger.

I can't say this is better or faster, but I strip diacritics in this way:

my $s = "åäö";
say $s.comb.map({.NFD[0].chr}).join; # output: "aao"
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!