问题
I am using Python 3.x. While using urllib.request
to download the webpage, i am getting a lot of \n
in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip()
function and the replace()
function...but no luck! I am running this code on eclipse. Here is my code:
import urllib.request
#Downloading entire Web Document
def download_page(a):
opener = urllib.request.FancyURLopener({})
try:
open_url = opener.open(a)
page = str(open_url.read())
return page
except:
return""
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)
#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)
I am not able to spot out the reason of getting a lot of \n
in the raw_html
variable.
回答1:
Seems like they are literal \n
characters , so i suggest you to do like this.
raw_html2 = raw_html.replace('\\n', '')
回答2:
Your download_page()
function corrupts the html (str()
call) that is why you see \n
(two characters \
and n
) in the output. Don't use .replace()
or other similar solution, fix download_page()
function instead:
from urllib.request import urlopen
with urlopen("http://www.zseries.in") as response:
html_content = response.read()
At this point html_content
contains a bytes
object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type
http header:
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
See A good way to get the charset/encoding of an HTTP response in Python.
if the server doesn't pass charset in Content-Type
header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8">
(you would need an html parser to get it).
If you read the html correctly then you shouldn't see literal characters \n
in the page.
回答3:
If you look at the source you've downloaded, the \n
escape sequences you're trying to replace()
are actually escaped themselves: \\n
. Try this instead:
import urllib.request
def download_page(a):
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = str(open_url.read()).replace('\\n', '')
return page
I removed the try
/except
clause because generic except
statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.
来源:https://stackoverflow.com/questions/27674076/remove-newline-in-python-with-urllib