Scrapy: how to populate hierarchic items with multipel requests

江枫思渺然 提交于 2019-12-08 12:23:27

问题


This one is extension of Multiple nested request with scrapy . Asking because presented solution have flaws:
1. It iliminates asynchrony, thus heavily reducing scraping efficiency
2. Should exception appeare while processing links "stack" and no item will be yelded
3. What if there is a huge amount of child items?

To deal with (1) I considered this:

class CatLoader(ItemLoader):

    def __int__(self, item=None, selector=None, response=None, parent=None, **context):
        super(self.__class__, self).__init__(item, selector, response, parent, **context)
        self.lock = threading.Lock()
        self.counter = 0

    def dec_counter(self):
        self.lock.acquire()
        self.counter += 1
        self.lock.release()

Then in parser:

    if len(urls) == 0:
        self.logger.warning('Cat without items, url: ' + response.url)
        item = cl.load_item()
        yield item
    cl.counter = len(urls)
    for url in urls:
        rq = Request(url, self.parse_item)
        rq.meta['loader'] = cl
        yield rq

And in parse_item() I can do:

def parse_item(self, response):
    l = response.meta['loader']

    l.dec_counter()
    if l.counter == 0:
        yield l.load_item()

BUT! To deal with 2 i neeed in each function do:

def parse_item(self, response):
    try:
        l = response.meta['loader']

    finally:
        l.dec_counter()
        if l.counter == 0:
            yield l.load_item()

Which I consider not elegant solution. So could anyone help with better solution? Also I'm up to insert items to DB, rather than json output, so maybe it better to create item with promise and make pipline, that parses children to check if promise is fulfiled(when item is inserted to DB), or something like that?

UPD: Hierchic items: category -> article -> images. All to be saved in different tables with proper relations. So: 1) Articles must be inservet to table AFTER category. 2) Article must know ID of it's category to form relation Same thing for images records

来源:https://stackoverflow.com/questions/46383499/scrapy-how-to-populate-hierarchic-items-with-multipel-requests

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!