Remove newline in python with urllib

前端 未结 3 480
名媛妹妹
名媛妹妹 2021-01-16 06:27

I am using Python 3.x. While using urllib.request to download the webpage, i am getting a lot of \\n in between. I am trying to remove it using the

3条回答
  •  小鲜肉
    小鲜肉 (楼主)
    2021-01-16 07:05

    Your download_page() function corrupts the html (str() call) that is why you see \n (two characters \ and n) in the output. Don't use .replace() or other similar solution, fix download_page() function instead:

    from urllib.request import urlopen
    
    with urlopen("http://www.zseries.in") as response:
        html_content = response.read()
    

    At this point html_content contains a bytes object. To get it as text, you need to know its character encoding e.g., to get it from Content-Type http header:

    encoding = response.headers.get_content_charset('utf-8')
    html_text = html_content.decode(encoding)
    

    See A good way to get the charset/encoding of an HTTP response in Python.

    if the server doesn't pass charset in Content-Type header then there are complex rules to figure out the character encoding in html5 document e.g., it may be specified inside html document: (you would need an html parser to get it).

    If you read the html correctly then you shouldn't see literal characters \n in the page.

提交回复
热议问题