Issue scraping with Beautiful Soup

后端未结

关注

 2  526

I\'ve been scraping websites before using this same technique. But with this website it seems to not work.

import urllib2
from BeautifulSoup import BeautifulSoup


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2021-02-10 01:27
              
            
            
                                                                       

  but I want to know why I am getting a gif accesing the url like that
  and when I access it via my browser I get the website perfectly.


because these guys are smart and don't want their website to be accessed outside a web browser. What you need to do is to fake a known browser by adding User-agent to the header. Here is a modified example that will work

>>> import urllib2
>>> opener = urllib2.build_opener()
>>> opener.addheaders = [('User-agent', 'Mozilla/5.0')]
>>> url = "http://www.weatheronline.co.uk/weather/maps/current?LANG=en&DATE=1354104000&CONT=euro&LAND=UK&KEY=UK&SORT=1&INT=06&TYP=sonne&ART=tabelle&RUBRIK=akt&R=310&CEL=C"
>>> response = opener.open(url)
>>> page = response.read()
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(page)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  长发绾君心        
                
              
                            
                2021-02-10 01:37
              
            
            
                                                                       
It means that URL you are accessing is a GIF image, not a web page. In fact, I ran the script and saved "page" to a file, and you get a 1x1 pixel white (or possibly transparent) GIF.

The reason you don't get that with an actual web browser may in fact be because they don't want you to scrape it.

From their terms of use: 
"You may not copy, reproduce, republish, download, post, broadcast, transmit or otherwise use the Site's content in any way except for your own personal, non-commercial use. "

You could maybe fake a web-browser with some work, but I'd still recommend you to talk to WeatherOnline instead. They want you to pay for their data, but if you do so, you will surely get a nice API you can use instead of screen scraping.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复