问题
This code in python3 is problematic:
import urllib.request
fhand=urllib.request.urlopen('http://www.py4inf.com/code/romeo.txt')
print(fhand.read())
Its output is:
b'But soft what light through yonder window breaks'
b'It is the east and Juliet is the sun'
b'Arise fair sun and kill the envious moon'
b'Who is already sick and pale with grief'
Why did I get b'...'
?
What could I do to get the right response?
The right text should be
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
回答1:
The b'...'
is a byte string: an array of bytes, not a real string.
To convert to a real string, use
fhand.read().decode()
This uses the default encoding 'UTF-8'. For ASCII encoding, use
fhand.read().decode("ASCII")
for example
回答2:
As the documentation says, urlopen
returns an object whose read
method gives you a sequence of bytes, not a sequence of characters. In order to convert the bytes to printable characters, which is what you want, you will need to apply the decode
method, using the encoding that the bytes are in.
The reason the result seems to make sense is that the default encoding Python picks to display the bytes happens to be the right one, or at least happens to match the right one for these characters.
To do this properly, you should read().decode(encoding)
where encoding
is the encoding value from the Content-Type
HTTP header, accessible through the HTTPResponse object (that is, fhand
, in your code). If there is no Content-Type
header, or if it doesn't specify an encoding, you're reduced to guessing which encoding to use, but for typical English text it doesn't matter, and in many other cases it's probably going to be UTF-8.
回答3:
Python 3 distinguishes between byte sequences and strings. The "b" in front of the string tells you that urllib returned the contents as "raw" bytes. It might be worth reading into the python 3 bytes/strings situation, but basically, you did get the right text back. If you don't want the result to be bytes, you'd just have to convert it back to a "real" python string.
回答4:
The third-party requests library handles decoding to unicode strings automatically. It does its best to infer the correct encoding so you don't need to guess the encoding yourself.
>>> import requests
>>> r = requests.get('http://www.py4inf.com/code/romeo.txt')
>>> print(r.text)
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
Same thing with urllib.request
and an assumed UTF-8
encoding:
>>> from urllib.request import urlopen
>>> r = urlopen('http://www.py4inf.com/code/romeo.txt')
>>> print(r.read().decode('UTF-8'))
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
来源:https://stackoverflow.com/questions/33688837/urllib-for-python-3