Remove newline in python with urllib

I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \n in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip() function and the replace() function...but no luck! I am running this code on eclipse. Here is my code:

import urllib.request

#Downloading entire Web Document 
def download_page(a):
    opener = urllib.request.FancyURLopener({})
    try:
        open_url = opener.open(a)
        page = str(open_url.read())
        return page
    except:
        return""  
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)

#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)

I am not able to spot out the reason of getting a lot of \n in the raw_html variable.

Seems like they are literal \n characters , so i suggest you to do like this.

raw_html2 = raw_html.replace('\\n', '')

jfs

Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:

from urllib.request import urlopen

with urlopen("http://www.zseries.in") as response:
    html_content = response.read()

At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:

encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)

See A good way to get the charset/encoding of an HTTP response in Python.

if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8"> (you would need an html parser to get it).

If you read the html correctly then you shouldn't see literal characters \n in the page.

If you look at the source you've downloaded, the \n escape sequences you're trying to replace() are actually escaped themselves: \\n. Try this instead:

import urllib.request

def download_page(a):
    opener = urllib.request.FancyURLopener({})
    open_url = opener.open(a)
    page = str(open_url.read()).replace('\\n', '')
    return page

I removed the try/except clause because generic except statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.

来源：https://stackoverflow.com/questions/27674076/remove-newline-in-python-with-urllib

标签

python

python-3.x

urllib