Print web page source code in python

问题

I want to print a web page source code but python print command just prints empty space and I think it's because of its large size. Is there any way to print page source code in shell or at list in a file? I've tried printing in a file but this error occurred:

UnicodeEncodeError: 'charmap' codec can't encode character '\u06cc' in position 11826: character maps to <undefined>

How can I fix it?

import urllib.request
response = urllib.request.urlopen('http://www.farsnews.com')
html = response.read()

print(html)#prints empty space! 

hf=open('test.txt','w')
a=str(html,'utf-8')
hf.write(a)
hf.close()

Python easily prints a[0:1000] but for a[0:len(a)] as I said empty space!

回答1:

I've just tried the same on Win7 using python 3.2.5 and here's what I got:

Python 3.2.5 (default, May 15 2013, 23:07:10) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib import request
>>> r = request.urlopen("http://www.farsnews.com")
>>> bytecode = r.read()
>>> htmlstr = bytecode.decode()
>>> print(bytecode)

Printing bytecode works well as it prints the encoded representations for unicode chars but printing the htmlstr raises the UnicodeDecodeError on windows because some chars cannot be printed using current locale's default encoding (windows' cmd.exe is not unicode)

In my case the encoding that has been used was 'cp866' as I saw it in traceback.

By default py3k uses the 'utf-8' encoding to store string data and if you want to override it you should explicitly specify the encoding to use for decoding

So here's the possibble workaround:

>>> safe_str = bytecode.decode(encoding='cp866', errors='ignore')
>>> print(safe_str)

Actually, it's equivalent to

>>> safe_str = str(bytecode, encoding='cp866', errors='ignore')
>>> print(safe_str)

The second parameter errors tells whether the error should be rose when the encoding you're trying to use cannot decode the particular character

回答2:

I simply did

import requests
page = requests.get(url)
print (page.text.encode('utf8'))

If you're scraping websites with python then is this an awesome starting point. I also recommend that you look into BeautifulSoup (another method of parsing html).

来源：https://stackoverflow.com/questions/20299088/print-web-page-source-code-in-python

标签

python

string

python-3.x

urllib