Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

前端未结

关注

 3  509

I have the following code that is partially working,

class ThreadSpider(CrawlSpider):
    name = \'thread\'
    allowed_domains = [\'bbs.example.com\']
    star


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2021-02-09 11:23
              
            
            
                                                                       
I've had a similar issue that seemed specific to integrating Splash with a Scrapy CrawlSpider. It would visit only the start url and then close. The only way I managed to get it to work was to not use the scrapy-splash plugin and instead use the 'process_links' method to preppend the Splash http api url to all of the links scrapy collects. Then I made other adjustments to compensate for the new issues that arise from this method. Here's what I did:

You'need these two tools to put together the splash url and then take it apart if you intend to store it somewhere.

from urllib.parse import urlencode, parse_qs


With the splash url being preppended to every link, scrapy will filter them all out as 'off site domain requests', so we make make 'localhost' the allowed domain.

allowed_domains = ['localhost']
start_urls = ['https://www.example.com/']


However, this poses a problem because then we may end up endlessly crawling the web when we only want to crawl one site. Let's fix this with the LinkExtractor rules. By only scraping links from our desired domain, we get around the offsite request problem.

LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
process_links='process_links',


Here's the process_links method. The dictionary in the urlencode method is where you'll put all of your splash arguments.

def process_links(self, links):
    for link in links:
        if "http://localhost:8050/render.html?&" not in link.url:
            link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
                                                                          'wait':2.0})
    return links


Finally, to take the url back out of the splash url, use the parse_qs method.

parse_qs(response.url)['url'][0] 


One final note about this approach. You'll notice that I have an '&' in the splash url right at the beginning. (...render.html?&). This makes parsing the splash url to take out the actual url consistent no matter what order you have the arguments when you're using the urlencode method.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  北恋        
                
              
                            
                2021-02-09 11:37
              
            
            
                                                                       
Use below code - Just copy and paste

restrict_xpaths=('//a[contains(text(), "Next Page")]')


Instead of

restrict_xpaths=("//a[contains(text(), 'Next Page')]")

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  既然无缘        
                
              
                            
                2021-02-09 11:41
              
            
            
                                                                       
Seems to be related to https://github.com/scrapy-plugins/scrapy-splash/issues/92 

Personnaly I use dont_process_response=True so response is HtmlResponse (which is required by the code in _request_to_follows).

And I also redefine the _build_request method in my spyder, like so:

def _build_request(self, rule, link):
    r = SplashRequest(url=link.url, callback=self._response_downloaded, args={'wait': 0.5}, dont_process_response=True)
    r.meta.update(rule=rule, link_text=link.text)
    return r 


In the github issues, some users just redefine the _request_to_follow method in their class.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复