Why does text retrieved from pages sometimes look like gibberish?

你。 提交于 2019-12-12 17:42:05

问题


I'm using urllib and urllib2 in Python to open and read webpages but sometimes, the text I get is unreadable. For example, if I run this:

import urllib

text = urllib.urlopen('http://tagger.steve.museum/steve/object/141913').read()
print text

I get some unreadable text. I've read these posts:

Gibberish from urlopen

Does python urllib2 automatically uncompress gzip data fetched from webpage?

but can't seem to find my answer.

Thank you in advance for your help!


UPDATE: I fixed the problem by 'convincing' the server that my user-agent is a brower and not a crawler.

import urllib

class NewOpener(urllib.FancyURLopener):
  version = 'Mozilla/5.0 (X11; Linux i686) AppleWebKit/535.2 (KHTML, like Gecko) Ubuntu/11.10 Chromium/15.0.874.120 Chrome/15.0.874.120 Safari/535.2'

nop = NewOpener()
html_text = nop.open('http://tagger.steve.museum/steve/object/141913').read()

Thank you all for your replies.


回答1:


You can use Selenium to get the content. Download the server and client drivers, run server and run this:

from selenium import selenium
s = selenium("localhost", 4444, "*chrome", "http://tagger.steve.museum")
s.start()

s.open("/steve/object/141913")

text = s.get_html_source()
print text



回答2:


This gibberish is a real server response for the request to 'http://tagger.steve.museum/steve/object/141913'. Actually, it looks like obfuscated JavaScript, which, if executed by a browser, loads page content.

To get this content, you need to execute this JavaScript, and this can be a really difficult task within Python. If you still want to do this, take a look at pywebkitgtk.



来源:https://stackoverflow.com/questions/8271484/why-does-text-retrieved-from-pages-sometimes-look-like-gibberish

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!