Scrapy - parse a page to extract items - then follow and store item url contents

后端 未结 2 475
不思量自难忘°
不思量自难忘° 2021-01-30 04:42

I have a question on how to do this thing in scrapy. I have a spider that crawls for listing pages of items. Every time a listing page is found, with items, there\'s the parse_

相关标签:
2条回答
  • 2021-01-30 04:59

    After some testing and thinking, I found this solution that works for me. The idea is to use just the first rule, that gives you listings of items, and also, very important, add follow=True to that rule.

    And in parse_item() you have to yield a request instead of an item, but after you load the item. The request is to item detail url. And you have to send the loaded item to that request callback. You do your job with the response, and there is where you yield the item.

    So the finish of parse_item() will look like this:

    itemloaded = l.load_item()
    
    # fill url contents
    url = sel.select(item_url_xpath).extract()[0]
    request = Request(url, callback = lambda r: self.parse_url_contents(r))
    request.meta['item'] = itemloaded
    
    yield request
    

    And then parse_url_contents() will look like this:

    def parse_url_contents(self, response):
        item = response.request.meta['item']
        item['url_contents'] = response.body
        yield item
    

    If anyone has another (better) approach, let us know.

    Stefan

    0 讨论(0)
  • 2021-01-30 05:23

    I'm sitting with exactly the same problem, and from the fact that no-one has answered your question for 2 days I take it that the only solution is to follow that URL manually, from within your parse_item function.

    I'm new to Scrapy, so I wouldn't attempt it with that (although I'm sure it's possible), but my solution will be to use urllib and BeatifulSoup to load the second page manually, extract that information myself, and save it as part of the Item. Yes, much more trouble than Scrapy makes normal parsing, but it should get the job done with the least hassle.

    0 讨论(0)
提交回复
热议问题