how to select and extract texts between two elements?

前端未结

关注

 2  1920

I am trying to scrape this website using scrapy. The page structure looks like this:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  情话喂你        
                
              
                            
                2021-01-22 06:57
              
            
            
                                                                       
You can try to use below XPath expressions to fetch


all text nodes for "Follows" block:

//div[./preceding-sibling::h4[1]="Follows"]//text()

all text nodes for "Followed by" block:

//div[./preceding-sibling::h4[1]="Followed by"]//text()

all text nodes for "Spin off" block:

//div[./preceding-sibling::h4[1]="Spin-off"]//text()


                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  梦谈多话        
                
              
                            
                2021-01-22 07:03
              
            
            
                                                                       
An extraction pattern I like to use for these cases is:


loop over the "boundaries" (here, h4 elements)
while enumerating them starting from 1
using XPath's following-sibling axis, like in @Andersson's answer, to get elements before the next boundary, 
and filtering them by counting the number of preceding "boundary" elements, since we know from our enumeration where we are


This would be the loop:

$ scrapy shell 'http://www.imdb.com/title/tt0092455/trivia?tab=mc&ref_=tt_trv_cnn'
(...)
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
...     print(cnt, h4.xpath('normalize-space()').get())
... 
1 Follows 
2 Followed by 
3 Edited into 
4 Spun-off from 
5 Spin-off 
6 Referenced in 
7 Featured in 
8 Spoofed in 


And this is one example of using the enumeration to get elements between boundaries (note that this use XPath variables with $cnt in the expression and passing cnt=cnt in .xpath()):

>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
...     print(cnt, h4.xpath('normalize-space()').get())
...     print(h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
                       cnt=cnt).xpath(
                          'string(.//a)').getall())
... 
1 Follows 
['Star Trek', 'Star Trek: The Animated Series', 'Star Trek: The Motion Picture', 'Star Trek II: The Wrath of Khan', 'Star Trek III: The Search for Spock', 'Star Trek IV: The Voyage Home']
2 Followed by 
['Star Trek V: The Final Frontier', 'Star Trek VI: The Undiscovered Country', 'Star Trek: Deep Space Nine', 'Star Trek: Generations', 'Star Trek: Voyager', 'First Contact', 'Star Trek: Insurrection', 'Star Trek: Enterprise', 'Star Trek: Nemesis', 'Star Trek', 'Star Trek Into Darkness', 'Star Trek Beyond', 'Star Trek: Discovery', 'Untitled Star Trek Sequel']
3 Edited into 
['Reading Rainbow: The Bionic Bunny Show', 'The Unauthorized Hagiography of Vincent Price']
4 Spun-off from 
['Star Trek']
5 Spin-off 
['Star Trek: The Next Generation - The Transinium Challenge', 'A Night with Troi', 'Star Trek: Deep Space Nine', "Star Trek: The Next Generation - Future's Past", 'Star Trek: The Next Generation - A Final Unity', 'Star Trek: The Next Generation: Interactive VCR Board Game - A Klingon Challenge', 'Star Trek: Borg', 'Star Trek: Klingon', 'Star Trek: The Experience - The Klingon Encounter']
6 Referenced in 
(...)


Here's how you could use that to populate and item (here, I'm using a simple dict just for illustration):

>>> item = {}
>>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
...     key = h4.xpath('normalize-space()').get().strip() # there are some non-breaking spaces
...     if key in ['Follows', 'Followed by', 'Spin-off']:
...         values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
...                        cnt=cnt).xpath(
...                           'string(.//a)').getall()
...         item[key] = values
... 

>>> from pprint import pprint
>>> pprint(item)
{'Followed by': ['Star Trek V: The Final Frontier',
                 'Star Trek VI: The Undiscovered Country',
                 'Star Trek: Deep Space Nine',
                 'Star Trek: Generations',
                 'Star Trek: Voyager',
                 'First Contact',
                 'Star Trek: Insurrection',
                 'Star Trek: Enterprise',
                 'Star Trek: Nemesis',
                 'Star Trek',
                 'Star Trek Into Darkness',
                 'Star Trek Beyond',
                 'Star Trek: Discovery',
                 'Untitled Star Trek Sequel'],
 'Follows': ['Star Trek',
             'Star Trek: The Animated Series',
             'Star Trek: The Motion Picture',
             'Star Trek II: The Wrath of Khan',
             'Star Trek III: The Search for Spock',
             'Star Trek IV: The Voyage Home'],
 'Spin-off': ['Star Trek: The Next Generation - The Transinium Challenge',
              'A Night with Troi',
              'Star Trek: Deep Space Nine',
              "Star Trek: The Next Generation - Future's Past",
              'Star Trek: The Next Generation - A Final Unity',
              'Star Trek: The Next Generation: Interactive VCR Board Game - A '
              'Klingon Challenge',
              'Star Trek: Borg',
              'Star Trek: Klingon',
              'Star Trek: The Experience - The Klingon Encounter']}
>>> 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复