how to select and extract texts between two elements?

前端 未结 2 1912
伪装坚强ぢ
伪装坚强ぢ 2021-01-22 06:42

I am trying to scrape this website using scrapy. The page structure looks like this:


                      
相关标签:
2条回答
  • 2021-01-22 06:57

    You can try to use below XPath expressions to fetch

    • all text nodes for "Follows" block:

      //div[./preceding-sibling::h4[1]="Follows"]//text()
      
    • all text nodes for "Followed by" block:

      //div[./preceding-sibling::h4[1]="Followed by"]//text()
      
    • all text nodes for "Spin off" block:

      //div[./preceding-sibling::h4[1]="Spin-off"]//text()
      
    0 讨论(0)
  • 2021-01-22 07:03

    An extraction pattern I like to use for these cases is:

    • loop over the "boundaries" (here, h4 elements)
    • while enumerating them starting from 1
    • using XPath's following-sibling axis, like in @Andersson's answer, to get elements before the next boundary,
    • and filtering them by counting the number of preceding "boundary" elements, since we know from our enumeration where we are

    This would be the loop:

    $ scrapy shell 'http://www.imdb.com/title/tt0092455/trivia?tab=mc&ref_=tt_trv_cnn'
    (...)
    >>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
    ...     print(cnt, h4.xpath('normalize-space()').get())
    ... 
    1 Follows 
    2 Followed by 
    3 Edited into 
    4 Spun-off from 
    5 Spin-off 
    6 Referenced in 
    7 Featured in 
    8 Spoofed in 
    

    And this is one example of using the enumeration to get elements between boundaries (note that this use XPath variables with $cnt in the expression and passing cnt=cnt in .xpath()):

    >>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
    ...     print(cnt, h4.xpath('normalize-space()').get())
    ...     print(h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
                           cnt=cnt).xpath(
                              'string(.//a)').getall())
    ... 
    1 Follows 
    ['Star Trek', 'Star Trek: The Animated Series', 'Star Trek: The Motion Picture', 'Star Trek II: The Wrath of Khan', 'Star Trek III: The Search for Spock', 'Star Trek IV: The Voyage Home']
    2 Followed by 
    ['Star Trek V: The Final Frontier', 'Star Trek VI: The Undiscovered Country', 'Star Trek: Deep Space Nine', 'Star Trek: Generations', 'Star Trek: Voyager', 'First Contact', 'Star Trek: Insurrection', 'Star Trek: Enterprise', 'Star Trek: Nemesis', 'Star Trek', 'Star Trek Into Darkness', 'Star Trek Beyond', 'Star Trek: Discovery', 'Untitled Star Trek Sequel']
    3 Edited into 
    ['Reading Rainbow: The Bionic Bunny Show', 'The Unauthorized Hagiography of Vincent Price']
    4 Spun-off from 
    ['Star Trek']
    5 Spin-off 
    ['Star Trek: The Next Generation - The Transinium Challenge', 'A Night with Troi', 'Star Trek: Deep Space Nine', "Star Trek: The Next Generation - Future's Past", 'Star Trek: The Next Generation - A Final Unity', 'Star Trek: The Next Generation: Interactive VCR Board Game - A Klingon Challenge', 'Star Trek: Borg', 'Star Trek: Klingon', 'Star Trek: The Experience - The Klingon Encounter']
    6 Referenced in 
    (...)
    

    Here's how you could use that to populate and item (here, I'm using a simple dict just for illustration):

    >>> item = {}
    >>> for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
    ...     key = h4.xpath('normalize-space()').get().strip() # there are some non-breaking spaces
    ...     if key in ['Follows', 'Followed by', 'Spin-off']:
    ...         values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]',
    ...                        cnt=cnt).xpath(
    ...                           'string(.//a)').getall()
    ...         item[key] = values
    ... 
    
    >>> from pprint import pprint
    >>> pprint(item)
    {'Followed by': ['Star Trek V: The Final Frontier',
                     'Star Trek VI: The Undiscovered Country',
                     'Star Trek: Deep Space Nine',
                     'Star Trek: Generations',
                     'Star Trek: Voyager',
                     'First Contact',
                     'Star Trek: Insurrection',
                     'Star Trek: Enterprise',
                     'Star Trek: Nemesis',
                     'Star Trek',
                     'Star Trek Into Darkness',
                     'Star Trek Beyond',
                     'Star Trek: Discovery',
                     'Untitled Star Trek Sequel'],
     'Follows': ['Star Trek',
                 'Star Trek: The Animated Series',
                 'Star Trek: The Motion Picture',
                 'Star Trek II: The Wrath of Khan',
                 'Star Trek III: The Search for Spock',
                 'Star Trek IV: The Voyage Home'],
     'Spin-off': ['Star Trek: The Next Generation - The Transinium Challenge',
                  'A Night with Troi',
                  'Star Trek: Deep Space Nine',
                  "Star Trek: The Next Generation - Future's Past",
                  'Star Trek: The Next Generation - A Final Unity',
                  'Star Trek: The Next Generation: Interactive VCR Board Game - A '
                  'Klingon Challenge',
                  'Star Trek: Borg',
                  'Star Trek: Klingon',
                  'Star Trek: The Experience - The Klingon Encounter']}
    >>> 
    
    0 讨论(0)
提交回复
热议问题