How to crawl an entire website with Scrapy?

后端未结

关注

 2  1121

I\'m unable to crawl a whole website, Scrapy just crawls at the surface, I want to crawl deeper. Been googling for the last 5-6 hours and no help. My code below:


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  忘掉有多难        
                
              
                            
                2021-01-31 12:42
              
            
            
                                                                       
Rules short-circuit, meaning that the first rule a link satisfies will be the rule that gets applied, your second Rule (with callback) will not be called.

Change your rules to this:

rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  礼貌的吻别        
                
              
                            
                2021-01-31 12:51
              
            
            
                                                                       
When parsing the start_urls, deeper urls can be parsed by the tag href. Then, deeper request can be yielded in the function parse(). Here is a simple example. The most important source code is shown below:

from scrapy.spiders import Spider
from tutsplus.items import TutsplusItem
from scrapy.http    import Request
import re

class MySpider(Spider):
    name            = "tutsplus"
    allowed_domains = ["code.tutsplus.com"]
    start_urls      = ["http://code.tutsplus.com/"]

    def parse(self, response):
        links = response.xpath('//a/@href').extract()

        # We stored already crawled links in this list
        crawledLinks = []

        # Pattern to check proper link
        # I only want to get tutorial posts
        linkPattern = re.compile("^\/tutorials\?page=\d+")

        for link in links:
        # If it is a proper link and is not checked yet, yield it to the Spider
            if linkPattern.match(link) and not link in crawledLinks:
                link = "http://code.tutsplus.com" + link
                crawledLinks.append(link)
                yield Request(link, self.parse)

        titles = response.xpath('//a[contains(@class, "posts__post-title")]/h1/text()').extract()
        for title in titles:
            item = TutsplusItem()
            item["title"] = title
            yield item

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复