问题
Lets say i have this code:
use strict;
use LWP qw ( get );
my $content = get ( "http://www.msn.co.il" );
print STDERR $content;
The error log shows something like "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" which i'm guessing it's utf-16 ?
The website's encoding is with
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1255">
so why these characters appear and not the windows-1255 chars ?
And, another weird thing is that i have two servers:
the first server returning CP1255 chars and i can simply convert it to utf8, and the current server gives me these chars and i can't do anything with it ...
is there any configuration file in apache/perl/module that is messing up the encoding ? forcing something ... ?
The result in my website at the second server, is that the perl file and the headers are all utf8, so when i write text that aren't english chars, the content from the example above is showing ok ( even though it's weird utf chars ) but my own static text are look like "×ס'××ר××:"
One more thing that i tested is ...
Through perl:
my $content = `curl "http://www.anglo-saxon.co.il"`;
I get utf8 encoding.
Through Bash:
curl "http://www.anglo-saxon.co.il"
and here i get CP1255 ( Windows-1255 ) encoding ...
Also, when i run the script in bash - it gives CP1255, and when run it through the web - then it's utf8 again ...
fixed the problem by changin the content from utf8 - to what is supposed to, and then back to utf8:
use Text::Iconv;
my $converter = Text::Iconv->new("utf8", "CP1255");
$content=$converter->convert($content);
my $converter = Text::Iconv->new("CP1255", "utf8");
$content=$converter->convert($content);
回答1:
The string with the hex values that you gave appears to be a UTF-8 encoding. You are getting this because Perl ‘likes to’ use UTF-8 when it deals with strings. The LWP::Simple->get()
method automatically decodes the content from the server which includes undoing any Content-Encoding as well as converting to UTF-8.
You could dig into the internals and get a version that does change the character encoding (see HTTP::Message's decoded_content, which is used by HTTP::Response's decoded_content, which you can get from LWP::UserAgent's get). But it may be easier to re-encode the the data in your desired encoding with something like
use Encode;
...;
$cp1255_bytes = encode('CP1255', decode('UTF_8', $utf8_bytes));
The mixed readable/garbage characters you see are due to mixing multiple, incompatible encodings in the same stream. Probably the stream is labeled as UTF-8 but you are putting CP1255 encoded characters into it. You either need to label the stream as CP1255 and put only CP1255-encoded data into it, or label it as UTF-8 and put only UTF-8-encoded data into it. Remind yourself that bytes are not characters and convert between them appropriately.
回答2:
All of this manual encoding and decoding is unnecessary. The HTML is lying to you when it says that the page is encoded in windows-1255; the server says it's serving UTF-8, and it is. Blame Microsoft HTML-generation tools.
Anyway, since the server does return the correct encoding, this works:
my $response = LWP::UserAgent->new->get("http://www.msn.co.il/");
my $content = $res->decoded_content;
$content
is now a perl character string, ready to do whatever you need. If you want to convert it to some other encoding, then calling Encode::encode
on it is appropriate; do not use Encode::decode
as it's already been decoded once.
回答3:
http://www.msn.co.il is in UTF-8, and indicates that properly. The string "\xd7\x9c\xd7\x94\xd7\x93\xd7\xa4\xd7\xa1\xd7\x94" is also proper UTF-8 (להדפסה). I don't see the problem.
I think your second problem is due to you mixing different encodings (UTF-8 and Windows-1252). You might want to encode/decode your strings properly.
回答4:
First, note that you should import get
from LWP::Simple. Second, everything works fine with:
#!/usr/bin/perl
use strict; use warnings;
use LWP::Simple qw ( getstore );
getstore 'http://www.msn.co.il', 'test.html';
which indicates to me that the problem is the encoding of the filehandle to which you are sending the output.
来源:https://stackoverflow.com/questions/2341128/why-does-perls-lwp-gives-me-a-different-encoding-than-the-original-website