Why urllib.urlopen.read() does not correspond to source code?

前端 未结 5 645
说谎
说谎 2021-01-11 14:02

I\'m trying to fetch the following webpage:

import urllib
urllib.urlopen(\"http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]         


        
相关标签:
5条回答
  • 2021-01-11 14:33

    What you are getting from urlopen is the raw webpage meaning no javascript is executed css is not used; where as what you get from Chrome (or other browsers) is final webpage which included executable javascript (which might alter the HTML), css rendering etc. all of which does not happen in urlopen...

    Hence the difference, hope this is clear

    0 讨论(0)
  • 2021-01-11 14:38

    Also, some websites have a so called browser switch which might lead to different source being shown when using different browsers (e.g. show a light version for mobile browsers).

    Have a look at http://www.diveintopython.net/http_web_services/user_agent.html on how to change the User-Agent to something like "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1" (which is actually my User-Agent).

    0 讨论(0)
  • 2021-01-11 14:39

    you can use python Selenium to solved your issue. Here is a example code have a look.

    from selenium import webdriverr
    url = "http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1"
    browser = webdriver.Firefox()
    browser.get(url)
    sleep(10)
    all_body_id_html =  browser.find_element_by_id('body') # you can also get all html
    

    Then due your rest of work according to your choice some more example with browser instance

    def login(user='ssdf', password="cisin123"):
    content = browser.find_element_by_id('content')
    content.find_element_by_xpath('.//tbody/tr[2]//input[contains(@class,"textbox")]').send_keys(user)
    content.find_element_by_xpath('.//tbody/tr[3]//input[contains(@class,"textbox")]').send_keys(password)
    content.find_element_by_css_selector(".button").click()
    
    0 讨论(0)
  • 2021-01-11 14:42

    You can use Selenium with Firefox for solving the issue, but it may not be suitable in many cases as the browser pops up every-time you run the code. Another idea is to use a headless broswer like PhantomJS.

    The best way for this is to use the mechanize library. Install mechanize via pip.

    pip install mechanize
    

    Then you can use the following code:

    import mechanize 
    
    mb = mechanize.Browser()
    mb.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
    mb.set_handle_robots(False)
    url = "http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1"
    response = mb.open(url).read()
    print response
    

    It also provides option for sleep and executing scripts. You can read them in the documentation.

    0 讨论(0)
  • 2021-01-11 14:51

    It sounds like you want a library that can act like a browser and run the javascript for you, then give you the resulting source code. Windmill should be able to do this for you. (http://www.getwindmill.com/)

    There is a good article on how to use it for what you want here:
    http://www.packtpub.com/article/web-scraping-with-python

    0 讨论(0)
提交回复
热议问题