问题
I am currently using the following Puppeteer AWS Lambda Layer to scrape 30 URLs and create and save screenshots in S3. At the moment, I send 30 individual payloads therefore running 30 AWS Lambda functions. https://github.com/shelfio/chrome-aws-lambda-layer
Each JSON payload contains a URL and an image file name that are sent every 2-3 seconds to API Gateway via a POST request. The first 6 or 9 Lambda functions in the list seem to run fine, then they start to fail with Navigation failed because browser has disconnected!
as reported in AWS Cloudwatch.
So I am looking for an alternative solution, How could I edit the code below to batch screenshot a set of 30 URLs, by handling a single array of JSON payloads? (eg. For loop etc)
Here is my current code for generating individual AWS Lambda screenshots and sending to S3:
// src/capture.js
// this module will be provided by the layer
const chromeLambda = require("chrome-aws-lambda");
// aws-sdk is always preinstalled in AWS Lambda in all Node.js runtimes
const S3Client = require("aws-sdk/clients/s3");
process.setMaxListeners(0) // <== Important line - Fix MaxListerners Error
// create an S3 client
const s3 = new S3Client({ region: process.env.S3_REGION });
// default browser viewport size
const defaultViewport = {
width: 1920,
height: 1080
};
// here starts our function!
exports.handler = async event => {
// launch a headless browser
const browser = await chromeLambda.puppeteer.launch({
args: chromeLambda.args,
executablePath: await chromeLambda.executablePath,
defaultViewport
});
console.log("Event URL string is ", event.url)
const url = event.url;
const domain = (new URL(url)).hostname.replace('www.', '');
// open a new tab
const page = await browser.newPage();
// navigate to the page
await page.goto(event.url);
// take a screenshot
const buffer = await page.screenshot()
// upload the image using the current timestamp as filename
const result = await s3
.upload({
Bucket: process.env.S3_BUCKET,
Key: domain + `.png`,
Body: buffer,
ContentType: "image/png",
ACL: "public-read"
})
.promise();
// return the uploaded image url
return { url: result.Location };
};
Current Individual JSON Payload
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/gavurin.com.png","url":"https://gavurin.com"}
回答1:
I tried to replicate the issue and modify the code to use loop.
While working on this issue, I found several things worth pointing out:
- the lambda requires a lot of RAM (at least 1GB in my test, but more better). Using small amount of RAM lead to failures.
- lambda timeout must be large to handle a number of URLs to screenshot.
- your
img
from the JSON payload is not used at all. I did not modify this behavior, as I don't know if this is by design or not. - similar errors to yours were observed when running async for loop and/or not closing pages opened.
- modified return value to output an array of s3 urls.
- undefied URL
Modified code
Here is the modified code that worked in my tests using nodejs12.x
runtime:
// src/capture.js
var URL = require('url').URL;
// this module will be provided by the layer
const chromeLambda = require("chrome-aws-lambda");
// aws-sdk is always preinstalled in AWS Lambda in all Node.js runtimes
const S3Client = require("aws-sdk/clients/s3");
process.setMaxListeners(0) // <== Important line - Fix MaxListerners Error
// create an S3 client
const s3 = new S3Client({ region: process.env.S3_REGION });
// default browser viewport size
const defaultViewport = {
width: 1920,
height: 1080
};
// here starts our function!
exports.handler = async event => {
// launch a headless browser
const browser = await chromeLambda.puppeteer.launch({
args: chromeLambda.args,
executablePath: await chromeLambda.executablePath,
defaultViewport
});
const s3_urls = [];
for (const e of event) {
console.log(e);
console.log("Event URL string is ", e.url)
const url = e.url;
const domain = (new URL(url)).hostname.replace('www.', '');
// open a new tab
const page = await browser.newPage();
// navigate to the page
await page.goto(e.url);
// take a screenshot
const buffer = await page.screenshot()
// upload the image using the current timestamp as filename
const result = await s3
.upload({
Bucket: process.env.S3_BUCKET,
Key: domain + `.png`,
Body: buffer,
ContentType: "image/png",
ACL: "public-read"
})
.promise();
await page.close();
s3_urls.push({ url: result.Location });
}
await browser.close();
// return the uploaded image url
return s3_urls;
};
Example playload
[
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/gavurin.com.png","url":"https://gavurin.com"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/google.com.png","url":"https://google.com"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/amazon.com","url":"https://www.amazon.com"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/stackoverflow.com","url":"https://stackoverflow.com"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/duckduckgo.com","url":"https://duckduckgo.com"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/docs.aws.amazon.com","url":"https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-features.html"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/github.com","url":"https://github.com"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/github.com/shelfio/chrome-aws-lambda-layer","url":"https://github.com/shelfio/chrome-aws-lambda-layer"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/gwww.youtube.com","url":"https://www.youtube.com"},
{"img":"https://s3screenshotbucket-useast1v5.s3.amazonaws.com/w3docs.com","url":"https://www.w3docs.com"}
]
Example output in S3
来源:https://stackoverflow.com/questions/63489068/iterate-over-multiple-payloads-and-take-multiple-screenshots-with-puppeteer-aws