问题
I'm using urllib and urllib2 in Python to open and read webpages but sometimes, the text I get is unreadable. For example, if I run this:
import urllib
text = urllib.urlopen('http://tagger.steve.museum/steve/object/141913').read()
print text
I get some unreadable text. I've read these posts:
Gibberish from urlopen
Does python urllib2 automatically uncompress gzip data fetched from webpage?
but can't seem to find my answer.
Thank you in advance for your help!
UPDATE: I fixed the problem by 'convincing' the server that my user-agent is a brower and not a crawler.
import urllib
class NewOpener(urllib.FancyURLopener):
version = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.2 (KHTML, like Gecko) Ubuntu/11.10 Chromium/15.0.874.120 Chrome/15.0.874.120 Safari/535.2'
nop = NewOpener()
html_text = nop.open('http://tagger.steve.museum/steve/object/141913').read()
Thank you all for your replies.
回答1:
You can use Selenium to get the content. Download the server and client drivers, run server and run this:
from selenium import selenium
s = selenium("localhost", 4444, "*chrome", "http://tagger.steve.museum")
s.start()
s.open("/steve/object/141913")
text = s.get_html_source()
print text
回答2:
This gibberish is a real server response for the request to 'http://tagger.steve.museum/steve/object/141913'
. Actually, it looks like obfuscated JavaScript, which, if executed by a browser, loads page content.
To get this content, you need to execute this JavaScript, and this can be a really difficult task within Python. If you still want to do this, take a look at pywebkitgtk.
来源:https://stackoverflow.com/questions/8271484/why-does-text-retrieved-from-pages-sometimes-look-like-gibberish