I wonder why most modern solutions built using Perl don\'t enable UTF-8 by default.
I understand there are many legacy problems for core Perl scripts, where it may b
While reading this thread, I often get the impression that people are using "UTF-8" as a synonym to "Unicode". Please make a distinction between Unicode's "Code-Points" which are an enlarged relative of the ASCII code and Unicode's various "encodings". And there are a few of them, of which UTF-8, UTF-16 and UTF-32 are the current ones and a few more are obsolete.
Please, UTF-8 (as well as all other encodings) exists and have meaning in input or in output only. Internally, since Perl 5.8.1, all strings are kept as Unicode "Code-points". True, you have to enable some features as admiringly covered previously.
There's a truly horrifying amount of ancient code out there in the wild, much of it in the form of common CPAN modules. I've found I have to be fairly careful enabling Unicode if I use external modules that might be affected by it, and am still trying to identify and fix some Unicode-related failures in several Perl scripts I use regularly (in particular, iTiVo fails badly on anything that's not 7-bit ASCII due to transcoding issues).
You should enable the unicode strings feature, and this is the default if you use v5.14;
You should not really use unicode identifiers esp. for foreign code via utf8 as they are insecure in perl5, only cperl got that right. See e.g. http://perl11.org/blog/unicode-identifiers.html
Regarding utf8 for your filehandles/streams: You need decide by yourself the encoding of your external data. A library cannot know that, and since not even libc supports utf8, proper utf8 data is rare. There's more wtf8, the windows aberration of utf8 around.
BTW: Moose is not really "Modern Perl", they just hijacked the name. Moose is perfect Larry Wall-style postmodern perl mixed with Bjarne Stroustrup-style everything goes, with an eclectic aberration of proper perl6 syntax, e.g. using strings for variable names, horrible fields syntax, and a very immature naive implementation which is 10x slower than a proper implementation. cperl and perl6 are the true modern perls, where form follows function, and the implementation is reduced and optimized.
There are two stages to processing Unicode text. The first is "how can I input it and output it without losing information". The second is "how do I treat text according to local language conventions".
tchrist's post covers both, but the second part is where 99% of the text in his post comes from. Most programs don't even handle I/O correctly, so it's important to understand that before you even begin to worry about normalization and collation.
This post aims to solve that first problem
When you read data into Perl, it doesn't care what encoding it is. It allocates some memory and stashes the bytes away there. If you say print $str
, it just blits those bytes out to your terminal, which is probably set to assume everything that is written to it is UTF-8, and your text shows up.
Marvelous.
Except, it's not. If you try to treat the data as text, you'll see that Something Bad is happening. You need go no further than length
to see that what Perl thinks about your string and what you think about your string disagree. Write a one-liner like: perl -E 'while(<>){ chomp; say length }'
and type in 文字化け
and you get 12... not the correct answer, 4.
That's because Perl assumes your string is not text. You have to tell it that it's text before it will give you the right answer.
That's easy enough; the Encode module has the functions to do that. The generic entry point is Encode::decode
(or use Encode qw(decode)
, of course). That function takes some string from the outside world (what we'll call "octets", a fancy of way of saying "8-bit bytes"), and turns it into some text that Perl will understand. The first argument is a character encoding name, like "UTF-8" or "ASCII" or "EUC-JP". The second argument is the string. The return value is the Perl scalar containing the text.
(There is also Encode::decode_utf8
, which assumes UTF-8 for the encoding.)
If we rewrite our one-liner:
perl -MEncode=decode -E 'while(<>){ chomp; say length decode("UTF-8", $_) }'
We type in 文字化け and get "4" as the result. Success.
That, right there, is the solution to 99% of Unicode problems in Perl.
The key is, whenever any text comes into your program, you must decode it. The Internet cannot transmit characters. Files cannot store characters. There are no characters in your database. There are only octets, and you can't treat octets as characters in Perl. You must decode the encoded octets into Perl characters with the Encode module.
The other half of the problem is getting data out of your program. That's easy to; you just say use Encode qw(encode)
, decide what the encoding your data will be in (UTF-8 to terminals that understand UTF-8, UTF-16 for files on Windows, etc.), and then output the result of encode($encoding, $data)
instead of just outputting $data
.
This operation converts Perl's characters, which is what your program operates on, to octets that can be used by the outside world. It would be a lot easier if we could just send characters over the Internet or to our terminals, but we can't: octets only. So we have to convert characters to octets, otherwise the results are undefined.
To summarize: encode all outputs and decode all inputs.
Now we'll talk about three issues that make this a little challenging. The first is libraries. Do they handle text correctly? The answer is... they try. If you download a web page, LWP will give you your result back as text. If you call the right method on the result, that is (and that happens to be decoded_content
, not content
, which is just the octet stream that it got from the server.) Database drivers can be flaky; if you use DBD::SQLite with just Perl, it will work out, but if some other tool has put text stored as some encoding other than UTF-8 in your database... well... it's not going to be handled correctly until you write code to handle it correctly.
Outputting data is usually easier, but if you see "wide character in print", then you know you're messing up the encoding somewhere. That warning means "hey, you're trying to leak Perl characters to the outside world and that doesn't make any sense". Your program appears to work (because the other end usually handles the raw Perl characters correctly), but it is very broken and could stop working at any moment. Fix it with an explicit Encode::encode
!
The second problem is UTF-8 encoded source code. Unless you say use utf8
at the top of each file, Perl will not assume that your source code is UTF-8. This means that each time you say something like my $var = 'ほげ'
, you're injecting garbage into your program that will totally break everything horribly. You don't have to "use utf8", but if you don't, you must not use any non-ASCII characters in your program.
The third problem is how Perl handles The Past. A long time ago, there was no such thing as Unicode, and Perl assumed that everything was Latin-1 text or binary. So when data comes into your program and you start treating it as text, Perl treats each octet as a Latin-1 character. That's why, when we asked for the length of "文字化け", we got 12. Perl assumed that we were operating on the Latin-1 string "æååã" (which is 12 characters, some of which are non-printing).
This is called an "implicit upgrade", and it's a perfectly reasonable thing to do, but it's not what you want if your text is not Latin-1. That's why it's critical to explicitly decode input: if you don't do it, Perl will, and it might do it wrong.
People run into trouble where half their data is a proper character string, and some is still binary. Perl will interpret the part that's still binary as though it's Latin-1 text and then combine it with the correct character data. This will make it look like handling your characters correctly broke your program, but in reality, you just haven't fixed it enough.
Here's an example: you have a program that reads a UTF-8-encoded text file, you tack on a Unicode PILE OF POO
to each line, and you print it out. You write it like:
while(<>){
chomp;
say "$_
We're all in agreement that it is a difficult problem for many reasons, but that's precisely the reason to try to make it easier on everybody.
There is a recent module on CPAN, utf8::all, that attempts to "turn on Unicode. All of it".
As has been pointed out, you can't magically make the entire system (outside programs, external web requests, etc.) use Unicode as well, but we can work together to make sensible tools that make doing common problems easier. That's the reason that we're programmers.
If utf8::all doesn't do something you think it should, let's improve it to make it better. Or let's make additional tools that together can suit people's varying needs as well as possible.
`