Managing puppeteer for memory and performance

前端未结

关注

 3  421

天命终不由人 2020-12-08 22:35

I\'m using puppeteer for scraping some pages, but I\'m curious about how to manage this in production for a node app. I\'ll be scraping up to 500,000 pages in a day, but the

3条回答

囚心锁ツ (楼主)

2020-12-08 23:21

Reuse the browser and page instances instead of launching a browser each time
Also expose your chrome scraper to take requests from a queue rather than a rest endpoint. This would make sure chrome can take its sweet time and also if something crashes, the requests are in the queue.

Other performance related articles are,

How to not render images, fonts and stylesheets, https://www.scrapehero.com/how-to-increase-web-scraping-speed-using-puppeteer/
Improving Performance - https://docs.browserless.io/blog/2019/05/03/improving-puppeteer-performance.html
If you have enough time - CEF is worth another look - https://bitbucket.org/chromiumembedded/cef/src/master/ - I am currently looking at this through Java. Good thing is there is no need to run chromium processes separately, its more integrated but comes with its own PROs and CONs.

This is another example using puppeteer and generic-pool libraries.

const puppeteer = require('puppeteer');
const genericPool = require("generic-pool");

async function createChromePool() {

    const factory = {
        create: function() {
            //open an instance of the Chrome headless browser - Heroku buildpack requires these args
            return puppeteer.launch({ args: ['--no-sandbox', '--disable-setuid-sandbox', '--ignore-certificate-errors'] });
        },
        destroy: function(client) {
            //close the browser
            client.close();
        }
    };  
    const opts = { max: 1, acquireTimeoutMillis: 120000, priorityRange: 3};
    global.chromepool = genericPool.createPool(factory, opts);

    global.chromepool.on('factoryCreateError', function(err){
        debug(err);
    });
    global.chromepool.on('factoryDestroyError', function(err){
        debug(err);
    });

}

async function destroyChromePool() {

    // Only call this once in your application -- at the point you want to shutdown and stop using this pool.
    global.chromepool.drain().then(function() {
        global.chromepool.clear();
    });

}

0 讨论(0)

查看其它3个回答