Managing puppeteer for memory and performance

前端 未结 3 422
天命终不由人
天命终不由人 2020-12-08 22:35

I\'m using puppeteer for scraping some pages, but I\'m curious about how to manage this in production for a node app. I\'ll be scraping up to 500,000 pages in a day, but the

相关标签:
3条回答
  • 2020-12-08 23:14

    If you are scraping 500,000 pages per day (approximately one page every 0.1728 seconds), then I would recommend opening a new page in an existing browser session rather than opening a new browser session for each page.

    You can open and close a Page using the following method:

    const page = await browser.newPage();
    await page.close();
    

    If you decide to use one Browser for your project, I would make sure to implement error handling procedures to ensure that if the program crashes, you have minimal downtime while you create a new Page, Browser, or BrowserContext.

    0 讨论(0)
  • 2020-12-08 23:14

    You probably want to create a pool of multiple Chromium instances with independent browsers. The advantage of that is, when one browser crashes all other jobs can keep running. The advantage of one browser (with multiple pages) is a slight memory and CPU advantage and the cookies are shared between your pages.

    Pool of puppeteer instances

    The library puppteer-cluster (disclaimer: I'm the author) creates a pool of browsers or pages for you. It takes care of the creation, error handling, browser restarting, etc. for you. So you can simply queue jobs/URLs and the library takes care of everything else.

    Code sample

    const { Cluster } = require('puppeteer-cluster');
    
    (async () => {
        const cluster = await Cluster.launch({
            concurrency: Cluster.CONCURRENCY_BROWSER, // use one browser per worker
            maxConcurrency: 4, // cluster with four workers
        });
    
        // Define a task to be executed for your data (put your "crawling code" in here)
        await cluster.task(async ({ page, data: url }) => {
            await page.goto(url);
            // ...
        });
    
        // Queue URLs when the cluster is created
        cluster.queue('http://www.google.com/');
        cluster.queue('http://www.wikipedia.org/');
    
        // Or queue URLs anytime later
        setTimeout(() => {
            cluster.queue('http://...');
        }, 1000);
    })();
    

    You can also queue functions directly in case you have different task to do. Normally you would close the cluster after you are finished via cluster.close(), but you are free to just let it stay open. You find another example for a cluster that gets data when a request comes in in the repository.

    0 讨论(0)
  • 2020-12-08 23:21
    • Reuse the browser and page instances instead of launching a browser each time
    • Also expose your chrome scraper to take requests from a queue rather than a rest endpoint. This would make sure chrome can take its sweet time and also if something crashes, the requests are in the queue.

    Other performance related articles are,

    1. How to not render images, fonts and stylesheets, https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
    2. Improving Performance - https://docs.browserless.io/blog/2019/05/03/improving-puppeteer-performance.html
    3. If you have enough time - CEF is worth another look - https://bitbucket.org/chromiumembedded/cef/src/master/ - I am currently looking at this through Java. Good thing is there is no need to run chromium processes separately, its more integrated but comes with its own PROs and CONs.

    This is another example using puppeteer and generic-pool libraries.

    const puppeteer = require('puppeteer');
    const genericPool = require("generic-pool");
    
    async function createChromePool() {
    
        const factory = {
            create: function() {
                //open an instance of the Chrome headless browser - Heroku buildpack requires these args
                return puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox', '--ignore-certificate-errors'] });
            },
            destroy: function(client) {
                //close the browser
                client.close();
            }
        };  
        const opts = { max: 1, acquireTimeoutMillis: 120000, priorityRange: 3};
        global.chromepool = genericPool.createPool(factory, opts);
    
        global.chromepool.on('factoryCreateError', function(err){
            debug(err);
        });
        global.chromepool.on('factoryDestroyError', function(err){
            debug(err);
        });
    
    }
    
    async function destroyChromePool() {
    
        // Only call this once in your application -- at the point you want to shutdown and stop using this pool.
        global.chromepool.drain().then(function() {
            global.chromepool.clear();
        });
    
    }
    
    0 讨论(0)
提交回复
热议问题