问题
I am learning Puppeteer & trying to scrape a website that has infinite scroll implemented. I am able to get all the Prices from the list, by scrolling down after a delay of 1 second. Here is the URL
What I want to do is, open a item from the list, get the product name, go back to the list, select the second product and do this for all products.
const fs = require('fs');
const puppeteer = require('puppeteer');
function extractItems() {
const extractedElements = document.querySelectorAll('.price');
const items = [];
for (let element of extractedElements) {
items.push(element.innerText);
}
return items;
}
async function scrapeInfiniteScrollItems(
page,
extractItems,
itemTargetCount,
scrollDelay = 1000,
) {
let items = [];
try {
let previousHeight;
while (items.length < itemTargetCount) {
items = await page.evaluate(extractItems);
previousHeight = await page.evaluate('document.body.scrollHeight');
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
await page.waitFor(scrollDelay);
}
} catch(e) { }
return items;
}
(async () => {
// Set up browser and page.
const browser = await puppeteer.launch({
headless: false,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
});
const page = await browser.newPage();
page.setViewport({ width: 1280, height: 926 });
// Navigate to the demo page.
await page.goto('https://www.clubfactory.com/views/product.html?categoryId=53&subId=53&filter=%7B%22Price%22%3A%5B%7B%22beg%22%3A1.32%2C%22end%22%3A0%7D%5D%7D');
// Scroll and extract items from the page.
const items = await scrapeInfiniteScrollItems(page, extractItems, 4000);
// Save extracted items to a file.
fs.writeFileSync('./prices3.txt', items.join('\n') + '\n');
// Close the browser.
await browser.close();
})();
Any help is appreciated
回答1:
EDIT: I added a working snippet for the particular website listed on the question.
If you are into scraping, sometimes you must break the user experience down to little bits to mimic a real user to get what actual data that the user would get.
One easy way to deal with infinite scrolling is to remove all current elements, and scroll until there are another 10 or 100 new elements each time, or even trying to scrape all at once.
But you can also think another way,
- get the first element,
- click to open in new tab,
- parse the data,
- close tab,
- remove the element,
- and move on to next element. Scroll and wait till new element comes.
The problem with the concept is, you will never know how the scrolling and clicking is getting triggered. There can be multiple events bound to scrolling to deal with it in different sites. And, the provided site is in vueJS.
Code Snippet
The selector for each product is #__layout > section > main > section > section > div.products > div > div
.
We will scroll the selector, deal with it, then remove it. Afterwards we will trigger a scroll event so the browser knows something has changed.
window.scrollTo(0, 0);
const selector = `#__layout > section > main > section > section > div.products > div > div`;
const element = document.querySelector(selector)
element.scrollIntoView()
element.remove()
Result: (gif animation)
What's cool is, we do not need to scroll to the bottom of the page to trigger the change. Look how the scrollbar changes during the removal.
This works on sites like producthunt as well. Video Link for better quality view.
const delay = d=>new Promise(r=>setTimeout(r,d))
const scrollAndRemove = async () => {
// scroll to top to trigger the scroll events
window.scrollTo(0, 0);
const selector = `.title_9ddaf`;
const element = document.querySelector(selector);
// stop if there are no elements left
if(element){
element.scrollIntoView();
// do my action
// wait for a moment to reduce load or lazy loading image
await delay(1000);
console.log(element.innerText);
// end of my action
// remove the element to trigger some scroll event somewhere
element.remove();
// return another promise
return scrollAndRemove()
}
}
scrollAndRemove();
来源:https://stackoverflow.com/questions/52660676/node-js-puppteer-infinite-scroll-loop