Why urllib.urlopen.read() does not correspond to source code?

问题

I'm trying to fetch the following webpage:

import urllib
urllib.urlopen("http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1").read()

The result does not correspond to what I see when inspecting the source code of the webpage using Google Chrome for example.

Could you tell me why this happens and how I could improve my code to overcome the problem?

Thank you for your help.

回答1:

What you are getting from urlopen is the raw webpage meaning no javascript is executed css is not used; where as what you get from Chrome (or other browsers) is final webpage which included executable javascript (which might alter the HTML), css rendering etc. all of which does not happen in urlopen...

Hence the difference, hope this is clear

回答2:

you can use python Selenium to solved your issue. Here is a example code have a look.

from selenium import webdriverr
url = "http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1"
browser = webdriver.Firefox()
browser.get(url)
sleep(10)
all_body_id_html =  browser.find_element_by_id('body') # you can also get all html

Then due your rest of work according to your choice some more example with browser instance

def login(user='ssdf', password="cisin123"):
content = browser.find_element_by_id('content')
content.find_element_by_xpath('.//tbody/tr[2]//input[contains(@class,"textbox")]').send_keys(user)
content.find_element_by_xpath('.//tbody/tr[3]//input[contains(@class,"textbox")]').send_keys(password)
content.find_element_by_css_selector(".button").click()

回答3:

You can use Selenium with Firefox for solving the issue, but it may not be suitable in many cases as the browser pops up every-time you run the code. Another idea is to use a headless broswer like PhantomJS.

The best way for this is to use the mechanize library. Install mechanize via pip.

pip install mechanize

Then you can use the following code:

import mechanize 

mb = mechanize.Browser()
mb.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
mb.set_handle_robots(False)
url = "http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1"
response = mb.open(url).read()
print response

It also provides option for sleep and executing scripts. You can read them in the documentation.

回答4:

Also, some websites have a so called browser switch which might lead to different source being shown when using different browsers (e.g. show a light version for mobile browsers).

Have a look at http://www.diveintopython.net/http_web_services/user_agent.html on how to change the User-Agent to something like "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1" (which is actually my User-Agent).

回答5:

It sounds like you want a library that can act like a browser and run the javascript for you, then give you the resulting source code. Windmill should be able to do this for you. (http://www.getwindmill.com/)

There is a good article on how to use it for what you want here:
http://www.packtpub.com/article/web-scraping-with-python

来源：https://stackoverflow.com/questions/12466900/why-urllib-urlopen-read-does-not-correspond-to-source-code

标签

python

urllib

urlopen