Remove newline in python with urllib

爷,独闯天下 提交于 2019-12-01 14:00:51

Seems like they are literal \n characters , so i suggest you to do like this.

raw_html2 = raw_html.replace('\\n', '')
jfs

Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:

from urllib.request import urlopen

with urlopen("http://www.zseries.in") as response:
    html_content = response.read()

At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:

encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)

See A good way to get the charset/encoding of an HTTP response in Python.

if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).

If you read the html correctly then you shouldn't see literal characters \n in the page.

If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:

import urllib.request

def download_page(a):
    opener = urllib.request.FancyURLopener({})
    open_url = opener.open(a)
    page = str(open_url.read()).replace('\\n', '')
    return page

I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!