Crawling multiple URL in a loop using puppeteer

问题

I have

urls = ['url','url','url'...]

this is what I'm doing

urls.map(async (url)=>{
  await page.goto(url);
  await page.waitForNavigation({ waitUntil: 'networkidle' });
})

This seems to not wait for page load and visit all the urls quite rapidly(i even tried using page.waitFor )

just wanted to know am I doing something fundamentally wrong or this type of functionality is not advised/supported

回答1:

map, forEach, reduce, etc, does not wait for the asynchronous operation within them, before they proceed to the next element of the iterator they are iterating over.

There are multiple ways of going through each item of an iterator synchronously while performing an asynchronous operation, but the easiest in this case I think would be to simply use a normal for operator, which does wait for the operation to finish.

const urls = [...]

for (let i = 0; i < urls.length; i++) {
    const url = urls[i];
    await page.goto(`${url}`);
    await page.waitForNavigation({ waitUntil: 'networkidle2' });
}

This would visit one url after another, as you are expecting. If you are curious about iterating serially using await/async, you can have a peek at this answer: https://stackoverflow.com/a/24586168/791691

回答2:

If you find that you are waiting on your promise indefinitely, the proposed solution is to use the following:

const urls = [...]

for (let i = 0; i < urls.length; i++) {
    const url = urls[i];
    const promise = page.waitForNavigation({ waitUntil: 'networkidle' });
    await page.goto(`${url}`);
    await promise;
}

As referenced from this github issue

回答3:

The accepted answer shows how to serially visit each page one at a time. However, you may want to visit multiple pages simultaneously when the task is embarrassingly parallel, that is, scraping a particular page isn't dependent on data extracted from other pages.

A tool that can help achieve this is Promise.allSettled which lets us fire off a bunch of promises at once, determine which were successful and harvest results.

For a basic example, let's say we want to scrape usernames for Stack Overflow users given a series of ids.

Serial code:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({dumpio: false});
  const baseURL = "https://stackoverflow.com/users";
  const startId = 6243352;
  const qty = 40;
  const usernames = [];
  
  for (let i = startId; i < startId + qty; i++) {
    await page.goto(`${baseURL}/${i}`);
  
    try {
      usernames.push(await page.$eval(
        ".profile-user--name", 
        el => el.children[0].innerText
      ));
    }
    catch (err) {}
  }

  console.log(usernames.length);
  await browser.close();
})();

Parallel code:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch({dumpio: false});
  const baseURL = "https://stackoverflow.com/users";
  const startId = 6243352;
  const qty = 40;

  const usernames = (await Promise.allSettled(
    [...Array(qty)].map(async (_, i) => {
      const page = await browser.newPage();
      await page.goto(`${baseURL}/${i + startId}`);
      return page.$eval(
        ".profile-user--name", 
        el => el.children[0].innerText
      );
    })))
    .filter(e => e.status === "fulfilled")
    .map(e => e.value)
  ;
  console.log(usernames.length);
  await browser.close();
})();

Remember that this is a technique, not a silver bullet that guarantees a speed increase on all workloads. It will take some experimentation to find the optimal balance between the cost of creating more page objects versus the parallelization of network requests on a given particular task and system.

The example here is contrived since it's not interacting with the page dynamically, so there's not as much room for gain as in a typical Puppeteer use case that involves network requests and blocking waits per page.

Of course, beware of rate limiting and any other restrictions imposed by sites.

For tasks where creating a page per task is prohibitively expensive or you'd like to set a cap on parallel request dispatches, consider using a task queue.

This pattern can also be extended to handle the case when certain pages depend on data from other pages, forming a dependency graph.

回答4:

Best way I found to achieve this.

 const puppeteer = require('puppeteer');
(async () => {
    const urls = ['https://www.google.com/', 'https://www.google.com/']
    for (let i = 0; i < urls.length; i++) {

        const url = urls[i];
        const browser = await puppeteer.launch({ headless: false });
        const page = await browser.newPage();
        await page.goto(`${url}`, { waitUntil: 'networkidle2' });
        await browser.close();

    }
})();

来源：https://stackoverflow.com/questions/46293216/crawling-multiple-url-in-a-loop-using-puppeteer

标签

web-scraping

google-chrome-headless

Puppeteer