Why urllib.urlopen.read() does not correspond to source code?

前端未结

关注

 5  649

I\'m trying to fetch the following webpage:

import urllib
urllib.urlopen(\"http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  你的背包        
                
              
                            
                2021-01-11 14:33
              
            
            
                                                                       
What you are getting from urlopen is the raw webpage meaning no javascript is executed css is not used; where as what you get from Chrome (or other browsers) is final webpage which included executable javascript (which might alter the HTML), css rendering etc. all of which does not happen in urlopen... 

Hence the difference, hope this is clear
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  抹茶落季        
                
              
                            
                2021-01-11 14:38
              
            
            
                                                                       
Also, some websites have a so called browser switch which might lead to different source being shown when using different browsers (e.g. show a light version for mobile browsers). 

Have a look at http://www.diveintopython.net/http_web_services/user_agent.html on how to change the User-Agent to something like "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1" (which is actually my User-Agent).
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人及你        
                
              
                            
                2021-01-11 14:39
              
            
            
                                                                       
you can use python Selenium to solved your issue. Here is a example code have a look.

from selenium import webdriverr
url = "http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1"
browser = webdriver.Firefox()
browser.get(url)
sleep(10)
all_body_id_html =  browser.find_element_by_id('body') # you can also get all html


Then due your rest of work according to your choice
some more example with browser instance

def login(user='ssdf', password="cisin123"):
content = browser.find_element_by_id('content')
content.find_element_by_xpath('.//tbody/tr[2]//input[contains(@class,"textbox")]').send_keys(user)
content.find_element_by_xpath('.//tbody/tr[3]//input[contains(@class,"textbox")]').send_keys(password)
content.find_element_by_css_selector(".button").click()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  慢半拍i        
                
              
                            
                2021-01-11 14:42
              
            
            
                                                                       
You can use Selenium with Firefox for solving the issue, but it may not be suitable in many cases as the browser pops up every-time you run the code. Another idea is to use a headless broswer like PhantomJS.

The best way for this is to use the mechanize library. Install mechanize via pip.

pip install mechanize


Then you can use the following code:

import mechanize 

mb = mechanize.Browser()
mb.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')] 
mb.set_handle_robots(False)
url = "http://www.gallimard-jeunesse.fr/searchjeunesse/advanced/(order)/author?catalog[0]=1&SearchAction=1"
response = mb.open(url).read()
print response


It also provides option for sleep and executing scripts. You can read them in the documentation.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  后悔当初        
                
              
                            
                2021-01-11 14:51
              
            
            
                                                                       
It sounds like you want a library that can act like a browser and run the javascript for you, then give you the resulting source code. Windmill should be able to do this for you. (http://www.getwindmill.com/)

There is a good article on how to use it for what you want here:

http://www.packtpub.com/article/web-scraping-with-python
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复