Remove newline in python with urllib

前端未结

关注

 3  485

I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \\n in between. I am trying to remove it using the

相关标签:

3条回答

我在风中等你

2021-01-16 06:48
Seems like they are literal \n characters , so i suggest you to do like this.
```
raw_html2 = raw_html.replace('\\n', '')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2021-01-16 07:00
If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:
```
import urllib.request

def download_page(a):
    opener = urllib.request.FancyURLopener({})
    open_url = opener.open(a)
    page = str(open_url.read()).replace('\\n', '')
    return page
```
I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.
0 讨论(0)
发布评论:

提交评论
- 加载中...
小鲜肉

2021-01-16 07:05
Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:
```
from urllib.request import urlopen

with urlopen("http://www.zseries.in") as response:
    html_content = response.read()
```
At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:
```
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
```
See A good way to get the charset/encoding of an HTTP response in Python.

if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).

If you read the html correctly then you shouldn't see literal characters \n in the page.
0 讨论(0)
发布评论:

提交评论
- 加载中...