Python: Pyppeteer with asyncio

时光总嘲笑我的痴心妄想 提交于 2019-12-22 19:25:44

问题


I was doing some tests and I wonder if the script below is running asynchronously?

# python test.py  It took 1.3439464569091797 seconds.

31 (sites) x 1.34 = 41.54s - so it's a few seconds less but in theory it should take only as long as the longest request?

# python test.py  It took 28.129364728927612 seconds.

Perhaps opening a browser is not async here and I should use executor for this?

# cat test.py 
import asyncio
import time

from pyppeteer import launch
from urllib.parse import urlparse

WEBSITE_LIST = [
    'http://envato.com',
    'http://amazon.co.uk',
    'http://amazon.com',
    'http://facebook.com',
    'http://google.com',
    'http://google.fr',
    'http://google.es',
    'http://google.co.uk',
    'http://internet.org',
    'http://gmail.com',
    'http://stackoverflow.com',
    'http://github.com',
    'http://heroku.com',
    'http://djangoproject.com',
    'http://rubyonrails.org',
    'http://basecamp.com',
    'http://trello.com',
    'http://yiiframework.com',
    'http://shopify.com',
    'http://airbnb.com',
    'http://instagram.com',
    'http://snapchat.com',
    'http://youtube.com',
    'http://baidu.com',
    'http://yahoo.com',
    'http://live.com',
    'http://linkedin.com',
    'http://yandex.ru',
    'http://netflix.com',
    'http://wordpress.com',
    'http://bing.com',
]

start = time.time()

async def fetch(url):
    browser = await launch(headless=True, args=['--no-sandbox'])
    page = await browser.newPage()
    await page.goto(f'{url}', {'waitUntil': 'load'})
    await page.screenshot({'path': f'img/{urlparse(url)[1]}.png'})
    await browser.close()

async def run():
    tasks = []

    for url in WEBSITE_LIST:
        task = asyncio.ensure_future(fetch(url))
        tasks.append(task)

    responses = await asyncio.gather(*tasks)
    #print(responses)

#asyncio.get_event_loop().run_until_complete(fetch('http://yahoo.com'))
loop = asyncio.get_event_loop()
future = asyncio.ensure_future(run())
loop.run_until_complete(future)

print(f'It took {time.time()-start} seconds.')

回答1:


According to pyppeteer source code, it is using subprocess without pipes to manage Chromium processes, and websockets to communicate, therefore it is async.

You have 31 sites, then you'll have 31+1 processes. So unless you have a CPU with 32 cores (there might also be threads, system processes, locks, hyper-threading and all different factors infecting the result, so this is just an imprecise example), it won't be fully executed in parallel. Therefore, the bottleneck I think is CPU opening browsers, rendering web pages and dumping into images. Using executor won't help.

However, it is still async. That means, your Python process is not blocked, you can still run other code or wait for network results concurrently. It is only that when the CPU is fully loaded by other processes, it becomes harder for the Python process to "steal" CPU time.



来源:https://stackoverflow.com/questions/51041482/python-pyppeteer-with-asyncio

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!