perl Encode::Guess with and without hints - detecting utf8

[亡魂溺海] 提交于 2019-12-11 12:48:12

问题


I am confused about Encode::Guess. Suppose this is my perl code:

use strict; 
use warnings;
use 5.18.2;
use Encode;
use Encode::Guess qw/utf8 iso-8859-1/;
use open IO => ':encoding(UTF-8)', ':std';
my $str1 = "1 = educa\x{c3}\x{a7}\x{c3}\x{a3}o";
my $str2 =  "2 = educa\x{e7}\x{e3}o";

say "A: ".&fixEnc($str1);
say "B: ".&fixEnc($str1,'hint');
say "C: ".&fixEnc($str2);
say "D: ".&fixEnc($str2,'hint');
say "";

sub fixEnc() {
    my $data = $_[0];
    my $enc = "";
    if ($_[1]) {
        $enc = guess_encoding($data,qw/utf8 iso-8859-1/);
    } else {
        $enc = guess_encoding($data);
    };
    if (!ref($enc)) {
        return "ERROR: Can't guess: $enc for $data";
    } else {
        my $utf8 = decode($enc->name, $data);
        $utf8 = "encoding guess: ".$enc->name."; result: $utf8";
        return $utf8;
    };
};

It produces:

A1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
B2: ERROR: Can't guess: utf8 or iso-8859-1 for 1 = educação
C1: encoding guess: iso-8859-1; result: 2 = educação
D1: encoding guess: iso-8859-1; result: 2 = educação

Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by ' use Encode::Guess;' I get

A2: encoding guess: utf8; result: 1 = educação
B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação
C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação
D2: encoding guess: iso-8859-1; result: 2 = educação

What causes the difference? In particular, why is utf8 not guessed when I hint with utf8?

Edit: I have posted an answer below. Basically, the realisation is that Guess goes by character encodings and doesn't speak Portuguese! 'educação', while not Portuguese is the correct latin-1 version of string 1 above that Guess cannot distinguish from the UTF8 version educação (unlike a Portuguese speaker).


回答1:


I think this is what's going on. With use Encode::Guess qw/utf8 iso-8859-1/; the 'hint' makes no difference (sorry for being unclear!), so we only have

A1/B1: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação

and C1/D1: encoding guess: iso-8859-1; result: 2 = educação

For A1/B2, the string could be UTF8 (educação) or it could be latin1 (educação). The 2nd one looks incorrect, but Encode::Guess cannot tell - Guess goes by character encodings and doesn't speak Portuguese!

Now if I replace 'use Encode::Guess qw/utf8 iso-8859-1/;' by 'use Encode::Guess;' I get

A2: encoding guess: utf8; result: 1 = educação

latin-1 is no longer an option (it's not part of the default), so the result comes out as utf8.

B2: ERROR: Can't guess: iso-8859-1 or utf8 for 1 = educação

In B2, with the hit, we're back in the above scenario, and Guess cannot decide.

For C2:

C2: ERROR: Can't guess: No appropriate encodings found! for 2 = educação

this makes sense, as latin-1 isn't part of the default. Finally in D2

D2: encoding guess: iso-8859-1; result: 2 = educação

latin-1 is hinted, so the encoding is detected.




回答2:


It's hard to say for sure because there are a few issues at work that make detecting the encoding difficult.

First is the fact that iso-8859-1 shares almost all of its code points with utf8. Unless there's a definitive byte-order mark at the start of the string or a character that doesn't exist in iso-8859-1, then Encode::Guess really is just guessing.

Second is mentioned in the Encode::Guess caveats in the perldocs. Encode::Guess runs through the text using a 'trial-and-error' algorithm to eliminate all but one of the provided encodings. Naturally the more alike to encodings are, the less accurate the module will be.

Third, when you don't specify the allowed encoding types in the use statement, the module will compare it to everything it can. This combined with the trial-and-error approach and the overlap in utf8 vs iso-8859-1 code points means it's possible for Encode::Guess to hit different conclusions based on the parameters passed to the method. I imagine you would get more consistent results if you checked against two more divergent encodings, like utf8 vs 7bit-jis.

Lastly, Perl has more than one implementation of utf8 so it's also possible that when you don't specify the 'utf8' encoding explicitly, it might be using a different implementation that may change the results as well. I don't know enough about Perl's internals to confirm that's what's happening in this case though.



来源:https://stackoverflow.com/questions/53146664/perl-encodeguess-with-and-without-hints-detecting-utf8

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!