Iterate over multiple payloads and take multiple screenshots with Puppeteer AWS Lambda

问题

I am currently using the following Puppeteer AWS Lambda Layer to scrape 30 URLs and create and save screenshots in S3. At the moment, I send 30 individual payloads therefore running 30 AWS Lambda functions. https://github.com/shelfio/chrome-aws-lambda-layer

Each JSON payload contains a URL and an image file name that are sent every 2-3 seconds to API Gateway via a POST request. The first 6 or 9 Lambda functions in the list seem to run fine, then they start to fail with Navigation failed because browser has disconnected! as reported in AWS Cloudwatch.

So I am looking for an alternative solution, How could I edit the code below to batch screenshot a set of 30 URLs, by handling a single array of JSON payloads? (eg. For loop etc)

Here is my current code for generating individual AWS Lambda screenshots and sending to S3:

// src/capture.js

// this module will be provided by the layer
const chromeLambda = require("chrome-aws-lambda");

// aws-sdk is always preinstalled in AWS Lambda in all Node.js runtimes
const S3Client = require("aws-sdk/clients/s3");

process.setMaxListeners(0) // <== Important line - Fix MaxListerners Error

// create an S3 client
const s3 = new S3Client({ region: process.env.S3_REGION });

// default browser viewport size
const defaultViewport = {
  width: 1920,
  height: 1080
};

// here starts our function!
exports.handler = async event => {

  // launch a headless browser
  const browser = await chromeLambda.puppeteer.launch({
    args: chromeLambda.args,
    executablePath: await chromeLambda.executablePath,
    defaultViewport
  });
  console.log("Event URL string is ", event.url)

  const url = event.url;
  const domain = (new URL(url)).hostname.replace('www.', '');

  // open a new tab
  const page = await browser.newPage();

  // navigate to the page
  await page.goto(event.url);

  // take a screenshot
  const buffer = await page.screenshot()

  // upload the image using the current timestamp as filename
  const result = await s3
    .upload({
      Bucket: process.env.S3_BUCKET,
      Key: domain + `.png`,
      Body: buffer,
      ContentType: "image/png",
      ACL: "public-read"
    })
    .promise();

  // return the uploaded image url
  return { url: result.Location };
};

Current Individual JSON Payload

{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/gavurin.com.png","url":"https://gavurin.com"}

回答1:

I tried to replicate the issue and modify the code to use loop.

While working on this issue, I found several things worth pointing out:

the lambda requires a lot of RAM (at least 1GB in my test, but more better). Using small amount of RAM lead to failures.
lambda timeout must be large to handle a number of URLs to screenshot.
your img from the JSON payload is not used at all. I did not modify this behavior, as I don't know if this is by design or not.
similar errors to yours were observed when running async for loop and/or not closing pages opened.
modified return value to output an array of s3 urls.
undefied URL

Modified code

Here is the modified code that worked in my tests using nodejs12.x runtime:

// src/capture.js

var URL = require('url').URL;

// this module will be provided by the layer
const chromeLambda = require("chrome-aws-lambda");

// aws-sdk is always preinstalled in AWS Lambda in all Node.js runtimes
const S3Client = require("aws-sdk/clients/s3");

process.setMaxListeners(0) // <== Important line - Fix MaxListerners Error

// create an S3 client
const s3 = new S3Client({ region: process.env.S3_REGION });

// default browser viewport size
const defaultViewport = {
  width: 1920,
  height: 1080
};

// here starts our function!
exports.handler = async event => {

  // launch a headless browser
  const browser = await chromeLambda.puppeteer.launch({
    args: chromeLambda.args,
    executablePath: await chromeLambda.executablePath,
    defaultViewport
  });
  
  const s3_urls = [];

  for (const e of event) {
    console.log(e);

    console.log("Event URL string is ", e.url)

    const url = e.url;
    const domain = (new URL(url)).hostname.replace('www.', '');

    // open a new tab
    const page = await browser.newPage();

    // navigate to the page
    await page.goto(e.url);

    // take a screenshot
    const buffer = await page.screenshot()

    // upload the image using the current timestamp as filename
    const result = await s3
      .upload({
        Bucket: process.env.S3_BUCKET,
        Key: domain + `.png`,
        Body: buffer,
        ContentType: "image/png",
        ACL: "public-read"
      })
      .promise();
      
      await page.close();
      
      s3_urls.push({ url: result.Location });
      
  }
  
  await browser.close();

  // return the uploaded image url
  return s3_urls;
};

Example playload

[
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/gavurin.com.png","url":"https://gavurin.com"},
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/google.com.png","url":"https://google.com"},
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/amazon.com","url":"https://www.amazon.com"},  
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/stackoverflow.com","url":"https://stackoverflow.com"},
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/duckduckgo.com","url":"https://duckduckgo.com"},
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/docs.aws.amazon.com","url":"https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-features.html"},  
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/github.com","url":"https://github.com"},  
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/github.com/shelfio/chrome-aws-lambda-layer","url":"https://github.com/shelfio/chrome-aws-lambda-layer"},  
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/gwww.youtube.com","url":"https://www.youtube.com"},   
    {"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/w3docs.com","url":"https://www.w3docs.com"}       
]

Example output in S3

来源：https://stackoverflow.com/questions/63489068/iterate-over-multiple-payloads-and-take-multiple-screenshots-with-puppeteer-aws

标签

javascript

node.js

amazon-web-services

aws-lambda

chromium