问题
I'm looking for a way to match only fully composed characters in a Unicode string.
Is [:print:]
dependent upon locale in any regular expression implementation that incorporates this character class? For example, will it match Japanese character 'あ', since it is not a control character, or is [:print:]
always going to be ASCII codes 0x20 to 0x7E?
Is there any character class, including Perl REs, that can be used to match anything other than a control character? If [:print:]
includes only characters in ASCII range I would assume [:cntrl:]
does too.
回答1:
echo あ| perl -nle 'BEGIN{binmode STDIN,":utf8"} print"[$_]"; print /[[:print:]]/ ? "YES" : "NO"'
This mostly works, though it generates a warning about a wide character. But it gives you the idea: you must be sure you're dealing with a real unicode string (check utf8::is_utf8). Or just check perlunicode at all - the whole subject still makes my head spin.
回答2:
I think you don't want or need locales for that but, but rather Unicode. If you have decoded a text string, \w
will match word characters in any language, \d
matches not just 0..9
but every Unicode digit etc. In regexes you can query Unicode properties with \p{PropertyName}
. Particularly interesting for you might be \p{Print}
. Here's a list of all the available Unicode character properties.
I wrote an article about the basics and subtleties of Unicode and Perl, it should give you a good idea on what to do that perl will recognize your string as a sequence of characters, not just a sequence of bytes.
Update: with Unicode you don't get language dependent behaviour, but instead sane defaults regardless of language. This may or may not be what you want, but for the distinction of priintable/control character I don't see why you'd need language dependent behaviour.
回答3:
\X
matches a fully-composed character (sequence). Proof:
#!/usr/bin/env perl
use 5.010;
use utf8;
use Encode qw(encode_utf8);
for my $string (qw(あ ご ご), "\x{3099}") {
say encode_utf8 sprintf "%s $string", $string =~ /\A \X \z/msx ? 'ok' : 'nok';
}
The test data are: a normal character, a pre-combined character, a combining character sequence and a combining character (which "doesn't count" on its own, a simplification of Chapter 3 of Unicode).
Substitute \X
with [[:print:]]
to see that Tanktalus' answer produces false matches for the last two cases.
回答4:
Yes, those expressions are locale dependant.
回答5:
You could always use the character class [^[:cntrl:]]
to match non-control characters.
来源:https://stackoverflow.com/questions/203605/how-do-i-match-only-fully-composed-characters-in-a-unicode-string-in-perl