Scrapy: how to populate hierarchic items with multipel requests

问题

This one is extension of Multiple nested request with scrapy . Asking because presented solution have flaws:
1. It iliminates asynchrony, thus heavily reducing scraping efficiency
2. Should exception appeare while processing links "stack" and no item will be yelded
3. What if there is a huge amount of child items?

To deal with (1) I considered this:

class CatLoader(ItemLoader):

    def __int__(self, item=None, selector=None, response=None, parent=None, **context):
        super(self.__class__, self).__init__(item, selector, response, parent, **context)
        self.lock = threading.Lock()
        self.counter = 0

    def dec_counter(self):
        self.lock.acquire()
        self.counter += 1
        self.lock.release()

Then in parser:

    if len(urls) == 0:
        self.logger.warning('Cat without items, url: ' + response.url)
        item = cl.load_item()
        yield item
    cl.counter = len(urls)
    for url in urls:
        rq = Request(url, self.parse_item)
        rq.meta['loader'] = cl
        yield rq

And in parse_item() I can do:

def parse_item(self, response):
    l = response.meta['loader']

    l.dec_counter()
    if l.counter == 0:
        yield l.load_item()

BUT! To deal with 2 i neeed in each function do:

def parse_item(self, response):
    try:
        l = response.meta['loader']

    finally:
        l.dec_counter()
        if l.counter == 0:
            yield l.load_item()

Which I consider not elegant solution. So could anyone help with better solution? Also I'm up to insert items to DB, rather than json output, so maybe it better to create item with promise and make pipline, that parses children to check if promise is fulfiled(when item is inserted to DB), or something like that?

UPD: Hierchic items: category -> article -> images. All to be saved in different tables with proper relations. So: 1) Articles must be inservet to table AFTER category. 2) Article must know ID of it's category to form relation Same thing for images records

来源：https://stackoverflow.com/questions/46383499/scrapy-how-to-populate-hierarchic-items-with-multipel-requests

标签

python

scrapy

synchronization

hierarchical