问题
I use TIdHttp to fetch web content. The response header indicates the content encoding to be utf8. I want to print content in console as CP936 (simplified chinese), but the actual content is not readable.
Result := TEncoding.Utf8.GetString(ResponseBuffer);
I do the same thing in python (using httplib2) without any problems.
def python_try():
conn = httplib2.HttpConn()
respose, content = conn.get(...)
print content.decode('utf8') # readable in console
UPDATE 1
I debugged the raw response and noticed that the content is gzipped.
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: text/html;charset=UTF-8
Transfer-Encoding: chunked
Content-Encoding: gzip
Vary: Accept-Encoding
Date: Mon, 24 Dec 2012 15:27:44 GMT
Connection: Keep-Alive
I tried to assign a IdCompressorZLib instance to IdHttp instance. Unfortunately, the application will crash while decompressing gzipped content. The test address is "http\://www.baidu.com" (encoding=gb2312).
UPDATE 2
I also tried to download a gzipped jquery script file, which contains only ascii chars. This time it works, which means to be a problem of Indy library. If I were not wrong, I should close the question.
回答1:
TIdHTTP
handles the gzip decompression for you, if you have a TIdCompressorZLib
component assigned to the TIdHTTP.Compressor
property. Otherwise, you will have to decompress it manually (TIdHTTP
will not send an Accept-Encoding
header by default if the Compressor
property is not assigned).
As for the UTF-8 encoding, TIdHTTP
also handles that for you as well, if you are calling the overloaded version of the TIdHTTP.Get()
or TIdHTTP.Post()
method that returns a String
value instead of fill a TStream
object. It will decode the UTF-8 to UTF-16 for you. To convert that to CP936, you can let the RTL do the conversion for you:
type
Cp936String = type AnsiString(936);
var
S: Cp936String;
begin
S := Cp936String(IdHTTP1.Get(...));
回答2:
Do not use any auto detect encoding, it cannot be done reliably. Simply believe the Content-Type header.
Result := TEncoding.Utf8.GetString(ResponseBuffer);
If the Content-Type header is missing or lying, then you need to detect encoding. Although I would not use any algorithm that would misdetect UTF-8 as CP936...
来源:https://stackoverflow.com/questions/14017186/failed-to-decode-response-content-using-idhttp