Perl: utf8::decode vs. Encode::decode

久未见 提交于 2019-12-03 13:20:40
daxim

You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:

Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.

Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.

You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.

length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.

#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);

my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter

Dump $test;
#  FLAGS = (PADMY,POK,pPOK)
#  PV = 0x8d8520 "\350\253\206\n"\0

$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
#  FLAGS = (PADMY,POK,pPOK,UTF8)
#  PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!