Twisted: Waiting for subtasks to finish

前端 未结 1 1738
挽巷
挽巷 2021-01-02 21:21

In my code, I have two hypothetical tasks: one gets urls from a generator and batch downloads them using Twisted\'s Cooperator, and the other takes a downloaded source and a

相关标签:
1条回答
  • 2021-01-02 22:03

    As currently written, it looks to me like this code will have a limited number of parallel downloads, but an unlimited number of parallel parse jobs. Is that intentional? I'm going to assume "no", since if your network happens to be fast and your parser happens to be slow, as the number of URLs approaches infinity, so does your memory usage :).

    So here's a thing that will have limited parallelism but carry out parses sequentially with downloads, instead:

    from twisted.internet import defer, task
    from twisted.web.client import getPage
    
    BATCH_SIZE = 5
    
    def main_task(reactor):
        def fetch_urls():
            for url in get_urls():
                yield getPage(url).addCallback(parse)
    
        coop = task.Cooperator()
        urls = fetch_urls()
    
        return (defer.DeferredList([coop.coiterate(urls)
                                   for _ in xrange(BATCH_SIZE)])
                .addCallback(task_finished))
    
    task.react(main_task)
    

    This works because since parse (apparently) returns a Deferred, adding it as a callback to the one returned by getPage results in a Deferred that won't call the callback added by coiterate until parse has done its business.

    Since you were asking about idiomatic Twisted code, I've also taken the liberty of modernizing it a bit (using task.react rather than running the reactor manually, inlining expressions to make things briefer and so on).

    If you really do want to have more parallel parses than parallel fetches, something like this might work better then:

    from twisted.internet import defer, task
    from twisted.web.client import getPage
    
    PARALLEL_FETCHES = 5
    PARALLEL_PARSES = 10
    
    def main_task(reactor):
        parseSemaphore = defer.DeferredSemaphore(PARALLEL_PARSES)
    
        def parseWhenReady(r):
            def parallelParse(_):
                parse(r).addBoth(
                    lambda result: parseSemaphore.release().addCallback(
                        lambda _: result
                    )
                )
            return parseSemaphore.acquire().addCallback(parallelParse)
    
        def fetch_urls():
            for url in get_urls():
                yield getPage(url).addCallback(parseWhenReady)
    
        coop = task.Cooperator()
        urls = fetch_urls()
    
        return (defer.DeferredList([coop.coiterate(urls)
                                   for _ in xrange(PARALLEL_FETCHES)])
                .addCallback(lambda done:
                             defer.DeferredList(
                                [parseSemaphore.acquire()
                                 for _ in xrange(PARALLEL_PARSES)]
                             ))
                .addCallback(task_finished))
    
    task.react(main_task)
    

    You can see that parseWhenReady returns the Deferred returned from acquire, so parallel fetching will continue as soon as the parallel parse can begin, and therefore you won't continue fetching indiscriminately even when the parser is overloaded. However, parallelParse carefully abstains from returning the Deferred returned by parse or release, since fetching should be able continue as those are proceeding.

    (Please note that since your initial example was not runnable, I haven't tested either of these at all. Hopefully the intent is clear even if there are bugs.)

    0 讨论(0)
提交回复
热议问题