问题
I want to print a web page source code but python print command just prints empty space and I think it's because of its large size. Is there any way to print page source code in shell or at list in a file? I've tried printing in a file but this error occurred:
UnicodeEncodeError: 'charmap' codec can't encode character '\u06cc' in position 11826: character maps to <undefined>
How can I fix it?
import urllib.request
response = urllib.request.urlopen('http://www.farsnews.com')
html = response.read()
print(html)#prints empty space!
hf=open('test.txt','w')
a=str(html,'utf-8')
hf.write(a)
hf.close()
Python easily prints a[0:1000]
but for a[0:len(a)]
as I said empty space!
回答1:
I've just tried the same on Win7 using python 3.2.5 and here's what I got:
Python 3.2.5 (default, May 15 2013, 23:07:10) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib import request
>>> r = request.urlopen("http://www.farsnews.com")
>>> bytecode = r.read()
>>> htmlstr = bytecode.decode()
>>> print(bytecode)
Printing bytecode
works well as it prints the encoded representations
for unicode chars but printing the htmlstr
raises the UnicodeDecodeError
on windows because some chars cannot be printed using current locale's
default encoding (windows' cmd.exe is not unicode)
In my case the encoding that has been used was 'cp866'
as I saw it in traceback.
By default py3k uses the 'utf-8'
encoding to store string data and if you want to override it you should explicitly specify the encoding to use for decoding
So here's the possibble workaround:
>>> safe_str = bytecode.decode(encoding='cp866', errors='ignore')
>>> print(safe_str)
Actually, it's equivalent to
>>> safe_str = str(bytecode, encoding='cp866', errors='ignore')
>>> print(safe_str)
The second parameter errors
tells whether the error should be rose when
the encoding you're trying to use cannot decode the particular character
回答2:
I simply did
import requests
page = requests.get(url)
print (page.text.encode('utf8'))
If you're scraping websites with python then is this an awesome starting point. I also recommend that you look into BeautifulSoup (another method of parsing html).
来源:https://stackoverflow.com/questions/20299088/print-web-page-source-code-in-python