I am using Python 3.x. While using urllib.request
to download the webpage, i am getting a lot of \n
in between. I am trying to remove it using the methods given in the other threads of the forum, but i am not able to do so. I have used strip()
function and the replace()
function...but no luck! I am running this code on eclipse. Here is my code:
import urllib.request
#Downloading entire Web Document
def download_page(a):
opener = urllib.request.FancyURLopener({})
try:
open_url = opener.open(a)
page = str(open_url.read())
return page
except:
return""
raw_html = download_page("http://www.zseries.in")
print("Raw HTML = " + raw_html)
#Remove line breaks
raw_html2 = raw_html.replace('\n', '')
print("Raw HTML2 = " + raw_html2)
I am not able to spot out the reason of getting a lot of \n
in the raw_html
variable.
Seems like they are literal \n
characters , so i suggest you to do like this.
raw_html2 = raw_html.replace('\\n', '')
Your download_page()
function corrupts the html (str()
call) that is why you see \n
(two characters \
and n
) in the output. Don't use .replace()
or other similar solution, fix download_page()
function instead:
from urllib.request import urlopen
with urlopen("http://www.zseries.in") as response:
html_content = response.read()
At this point html_content
contains a bytes
object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type
http header:
encoding = response.headers.get_content_charset('utf-8')
html_text = html_content.decode(encoding)
See A good way to get the charset/encoding of an HTTP response in Python.
if the server doesn't pass charset in Content-Type
header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: <meta charset="utf-8">
(you would need an html parser to get it).
If you read the html correctly then you shouldn't see literal characters \n
in the page.
If you look at the source you've downloaded, the \n
escape sequences you're trying to replace()
are actually escaped themselves: \\n
. Try this instead:
import urllib.request
def download_page(a):
opener = urllib.request.FancyURLopener({})
open_url = opener.open(a)
page = str(open_url.read()).replace('\\n', '')
return page
I removed the try
/except
clause because generic except
statements without targeting a specific exception (or class of exceptions) are generally bad. If it fails, you have no idea why.
来源:https://stackoverflow.com/questions/27674076/remove-newline-in-python-with-urllib