Python aiohttp/asyncio - how to process returned data

后端 未结 2 800
-上瘾入骨i
-上瘾入骨i 2021-02-04 05:07

Im in the process of moving some synchronous code to asyncio using aiohttp. the synchronous code was taking 15 minutes to run, so I\'m hoping to improves this.

I have so

相关标签:
2条回答
  • 2021-02-04 05:47

    Here's an example with concurrent.futures.ProcessPoolExecutor. If it's created without specifying max_workers, the implementation will use os.cpu_count instead. Also note that asyncio.wrap_future is public but undocumented. Alternatively, there's AbstractEventLoop.run_in_executor.

    import asyncio
    from concurrent.futures import ProcessPoolExecutor
    
    import aiohttp
    import lxml.html
    
    
    def process_page(html):
        '''Meant for CPU-bound workload'''
        tree = lxml.html.fromstring(html)
        return tree.find('.//title').text
    
    
    async def fetch_page(url, session):
        '''Meant for IO-bound workload'''
        async with session.get(url, timeout = 15) as res:
          return await res.text()
    
    
    async def process(url, session, pool):
        html = await fetch_page(url, session)
        return await asyncio.wrap_future(pool.submit(process_page, html))
    
    
    async def dispatch(urls):
        pool = ProcessPoolExecutor()
        async with aiohttp.ClientSession() as session:
            coros = (process(url, session, pool) for url in urls)
            return await asyncio.gather(*coros)
    
    
    def main():
        urls = [
          'https://stackoverflow.com/',
          'https://serverfault.com/',
          'https://askubuntu.com/',
          'https://unix.stackexchange.com/'
        ]
        result = asyncio.get_event_loop().run_until_complete(dispatch(urls))
        print(result)
    
    if __name__ == '__main__':
        main()
    
    0 讨论(0)
  • 2021-02-04 05:54

    Your code isn't far from the mark. asyncio.gather returns the results in the order of the arguments, so order is preserved here, but page_content will not be called in order.

    A few tweaks:

    First of all, you do not need ensure_future here. Creating a Task is only needed if you are trying to have a coroutine outlive its parent, ie if the task has to continue running even though the function that created it is done. Here what you need is instead calling asyncio.gather directly with your coroutines:

    async def get_url_data(urls, username, password):
        async with aiohttp.ClientSession(...) as session:
            responses = await asyncio.gather(*(fetch(session, i) for i in urls))
        for i in responses:
            print(i.title.text)
        return responses
    

    But calling this would schedule all the fetch at the same time, and with a high number of URLs, this is far from optimal. Instead you should choose a maximum concurrency and ensure at most X fetches are running at any time. To implement this, you can use a asyncio.Semaphore(20), this semaphore can only be acquired by at most 20 coroutines, so the others will wait to acquire until a spot is available.

    CONCURRENCY = 20
    TIMEOUT = 15
    
    async def fetch(session, sem, url):
        async with sem:
            async with session.get(url) as response:
                return page_content(await response.text())
    
    async def get_url_data(urls, username, password):
        sem = asyncio.Semaphore(CONCURRENCY)
        async with aiohttp.ClientSession(...) as session:
            responses = await asyncio.gather(*(
                asyncio.wait_for(fetch(session, sem, i), TIMEOUT)
                for i in urls
            ))
        for i in responses:
            print(i.title.text)
        return responses
    

    This way, all the fetches are started immediately, but only 20 of them will be able to acquire the semaphore. The others will block at the first async with instruction and wait until another fetch is done.

    I have also replaced the aiohttp.Timeout with the official asyncio equivalent here.

    Finally, for the actual processing of the data, if you are limited by CPU time, asyncio will probably not help you much. You will need to use a ProcessPoolExecutor here to parallelise the actual work to another CPU. run_in_executor will probably be of use to.

    0 讨论(0)
提交回复
热议问题