urllib for python 3

夙愿已清 提交于 2019-12-23 05:16:44

问题


This code in python3 is problematic:

import urllib.request
fhand=urllib.request.urlopen('http://www.py4inf.com/code/romeo.txt')
print(fhand.read())

Its output is:

b'But soft what light through yonder window breaks'
b'It is the east and Juliet is the sun'
b'Arise fair sun and kill the envious moon'
b'Who is already sick and pale with grief'

Why did I get b'...'? What could I do to get the right response?

The right text should be

But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

回答1:


The b'...' is a byte string: an array of bytes, not a real string.

To convert to a real string, use

fhand.read().decode()

This uses the default encoding 'UTF-8'. For ASCII encoding, use

fhand.read().decode("ASCII")

for example




回答2:


As the documentation says, urlopen returns an object whose read method gives you a sequence of bytes, not a sequence of characters. In order to convert the bytes to printable characters, which is what you want, you will need to apply the decode method, using the encoding that the bytes are in.

The reason the result seems to make sense is that the default encoding Python picks to display the bytes happens to be the right one, or at least happens to match the right one for these characters.

To do this properly, you should read().decode(encoding) where encoding is the encoding value from the Content-Type HTTP header, accessible through the HTTPResponse object (that is, fhand, in your code). If there is no Content-Type header, or if it doesn't specify an encoding, you're reduced to guessing which encoding to use, but for typical English text it doesn't matter, and in many other cases it's probably going to be UTF-8.




回答3:


Python 3 distinguishes between byte sequences and strings. The "b" in front of the string tells you that urllib returned the contents as "raw" bytes. It might be worth reading into the python 3 bytes/strings situation, but basically, you did get the right text back. If you don't want the result to be bytes, you'd just have to convert it back to a "real" python string.




回答4:


The third-party requests library handles decoding to unicode strings automatically. It does its best to infer the correct encoding so you don't need to guess the encoding yourself.

>>> import requests
>>> r = requests.get('http://www.py4inf.com/code/romeo.txt')
>>> print(r.text)
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief

Same thing with urllib.request and an assumed UTF-8 encoding:

>>> from urllib.request import urlopen
>>> r = urlopen('http://www.py4inf.com/code/romeo.txt')
>>> print(r.read().decode('UTF-8'))
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief


来源:https://stackoverflow.com/questions/33688837/urllib-for-python-3

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!