Constructing a regular expression for url in start_urls list in scrapy framework python

后端未结

关注

 2  379

迷失自我 2021-01-15 22:20

I am very new to scrapy and also i didn\'t used regular expressions before

The following is my spider.py code

class ExampleSpider(BaseSp


      
      
        
          2条回答        

        
                    
            
            
                         
                
              
              
                
                   走了就别回头了
                                             
                
                
                (楼主)
            
              
              
                2021-01-15 22:42
              

            
            
                        
If you are using CrawlSpider, it's not usually a good idea to override the parse method.

Rule object can filter the urls you are interesed to the ones you do not care for.

See CrawlSpider in the docs for reference. 

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
import re

class ExampleSpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com/bookstore']

    rules = (
        Rule(SgmlLinkExtractor(allow=('\/new\/[0-9]\?',)), callback='parse_bookstore'),
    )

def parse_boostore(self, response):
   hxs = HtmlXPathSelector(response)

    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它2个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复