How can I download and read a URL with universal newlines?

僤鯓⒐⒋嵵緔 提交于 2019-12-24 07:28:11

问题


I was using urllib.urlopen with Python 2.7, but I need to process the downloaded HTML document and its contained newlines (within a <pre> element).

The urllib docs indicates urlopen will not use universal newlines. How can I do this?


回答1:


Unless the HTML file is already on your disk, urlopen() will handle correctly all formats of newlines (\n, \r\n and \r) in the HTML file you want to parse (that is it will convert them to \n), according to the urllib docs:

"If the URL does not have a scheme identifier, or if it has file: as its scheme identifier, this opens a local file (without universal newlines)"

E.g.

>>> from urllib import urlopen
>>> urlopen("http://****.com/win_new_lines.htm").read()
'line 1\nline 2\n\n\nline 3'
>>> urlopen("http://****.com/unix_new_lines.htm").read()   
'line 1\nline 2\n\n\nline 3'



回答2:


When you process the contents of the pre tags, use splitlines to normalize the line-endings:

'\n'.join(contents.splitlines())


来源:https://stackoverflow.com/questions/8221296/how-can-i-download-and-read-a-url-with-universal-newlines

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!