问题
I was using urllib.urlopen
with Python 2.7, but I need to process the downloaded HTML document and its contained newlines (within a <pre>
element).
The urllib docs indicates urlopen will not use universal newlines. How can I do this?
回答1:
Unless the HTML file is already on your disk, urlopen()
will handle correctly all formats of newlines (\n
, \r\n
and \r
) in the HTML file you want to parse (that is it will convert them to \n
), according to the urllib docs:
"If the URL does not have a scheme identifier, or if it has file: as its scheme identifier, this opens a local file (without universal newlines)"
E.g.
>>> from urllib import urlopen
>>> urlopen("http://****.com/win_new_lines.htm").read()
'line 1\nline 2\n\n\nline 3'
>>> urlopen("http://****.com/unix_new_lines.htm").read()
'line 1\nline 2\n\n\nline 3'
回答2:
When you process the contents of the pre
tags, use splitlines to normalize the line-endings:
'\n'.join(contents.splitlines())
来源:https://stackoverflow.com/questions/8221296/how-can-i-download-and-read-a-url-with-universal-newlines