问题
According to https://github.com/GoogleChrome/puppeteer/issues/628, I should be able to get all links from < a href="xyz" > with this single line:
const hrefs = await page.$$eval('a', a => a.href);
But when I try a simple:
console.log(hrefs)
I only get:
http://example.de/index.html
... as output which means that it could only find 1 link? But the page definitely has 12 links in the source code / DOM. Why does it fail to find them all?
Minimal example:
'use strict';
const puppeteer = require('puppeteer');
crawlPage();
function crawlPage() {
(async () => {
const args = [
"--disable-setuid-sandbox",
"--no-sandbox",
"--blink-settings=imagesEnabled=false",
];
const options = {
args,
headless: true,
ignoreHTTPSErrors: true,
};
const browser = await puppeteer.launch(options);
const page = await browser.newPage();
await page.goto("http://example.de", {
waitUntil: 'networkidle2',
timeout: 30000
});
const hrefs = await page.$eval('a', a => a.href);
console.log(hrefs);
await page.close();
await browser.close();
})().catch((error) => {
console.error(error);
});;
}
回答1:
In your example code you're using page.$eval
, not page.$$eval
. Since the former uses document.querySelector
instead of document.querySelectorAll
, the behaviour you describe is the expected one.
Also, you should change your pageFunction
in the $$eval
arguments:
const hrefs = await page.$$eval('a', as => as.map(a => a.href));
回答2:
The page.$$eval() method runs Array.from(document.querySelectorAll(selector))
within the page and passes it as the first argument to the page function.
Since a
in your example represents an array, you will either need to specify which element of the array you want to obtain the href from, or you will need to map all of the href
attributes to an array.
page.$$eval()
const hrefs = await page.$$eval('a', links => links.map(a => a.href));
Alternatively, you can also use page.evaluate() or a combination of page.$$(), elementHandle.getProperty(), or jsHandle.jsonValue() to achieve an array of all links from the page.
page.evaluate()
const hrefs = await page.evaluate(() => {
return Array.from(document.getElementsByTagName('a'), a => a.href);
});
page.$$() / elementHandle.getProperty() / jsHandle.jsonValue()
const hrefs = await Promise.all((await page.$$('a')).map(async a => {
return await (await a.getProperty('href')).jsonValue();
}));
来源:https://stackoverflow.com/questions/49492017/how-to-get-all-links-from-the-dom