Node JS Puppteer Infinite scroll loop

旧城冷巷雨未停 提交于 2019-12-08 12:00:20

问题


I am learning Puppeteer & trying to scrape a website that has infinite scroll implemented. I am able to get all the Prices from the list, by scrolling down after a delay of 1 second. Here is the URL

What I want to do is, open a item from the list, get the product name, go back to the list, select the second product and do this for all products.

const fs = require('fs');
const puppeteer = require('puppeteer');
function extractItems() {
  const extractedElements = document.querySelectorAll('.price');
  const items = [];
  for (let element of extractedElements) {
    items.push(element.innerText);
  }
  return items;
}
async function scrapeInfiniteScrollItems(
  page,
  extractItems,
  itemTargetCount,
  scrollDelay = 1000,
) {
  let items = [];
  try {
    let previousHeight;
    while (items.length < itemTargetCount) {
      items = await page.evaluate(extractItems);
      previousHeight = await page.evaluate('document.body.scrollHeight');
      await page.evaluate('window.scrollTo(0, document.body.scrollHeight)');
      await page.waitForFunction(`document.body.scrollHeight > ${previousHeight}`);
      await page.waitFor(scrollDelay);
    }
  } catch(e) { }
  return items;
}
(async () => {
  // Set up browser and page.
  const browser = await puppeteer.launch({
    headless: false,
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });
  const page = await browser.newPage();
  page.setViewport({ width: 1280, height: 926 });
  // Navigate to the demo page.
  await page.goto('https://www.clubfactory.com/views/product.html?categoryId=53&subId=53&filter=%7B%22Price%22%3A%5B%7B%22beg%22%3A1.32%2C%22end%22%3A0%7D%5D%7D');
  // Scroll and extract items from the page.
  const items = await scrapeInfiniteScrollItems(page, extractItems, 4000);
  // Save extracted items to a file.
  fs.writeFileSync('./prices3.txt', items.join('\n') + '\n');
  // Close the browser.
  await browser.close();
})(); 

Any help is appreciated


回答1:


EDIT: I added a working snippet for the particular website listed on the question.

If you are into scraping, sometimes you must break the user experience down to little bits to mimic a real user to get what actual data that the user would get.

One easy way to deal with infinite scrolling is to remove all current elements, and scroll until there are another 10 or 100 new elements each time, or even trying to scrape all at once.

But you can also think another way,

  1. get the first element,
  2. click to open in new tab,
  3. parse the data,
  4. close tab,
  5. remove the element,
  6. and move on to next element. Scroll and wait till new element comes.

The problem with the concept is, you will never know how the scrolling and clicking is getting triggered. There can be multiple events bound to scrolling to deal with it in different sites. And, the provided site is in vueJS.

Code Snippet

The selector for each product is #__layout > section > main > section > section > div.products > div > div.

We will scroll the selector, deal with it, then remove it. Afterwards we will trigger a scroll event so the browser knows something has changed.

window.scrollTo(0, 0);
const selector = `#__layout > section > main > section > section > div.products > div > div`;
const element = document.querySelector(selector)
element.scrollIntoView()
element.remove()

Result: (gif animation)

What's cool is, we do not need to scroll to the bottom of the page to trigger the change. Look how the scrollbar changes during the removal.

This works on sites like producthunt as well. Video Link for better quality view.

const delay = d=>new Promise(r=>setTimeout(r,d))

const scrollAndRemove = async () => {
    // scroll to top to trigger the scroll events
    window.scrollTo(0, 0);
    const selector = `.title_9ddaf`;
    const element = document.querySelector(selector);

    // stop if there are no elements left
    if(element){
      element.scrollIntoView();

      // do my action
      // wait for a moment to reduce load or lazy loading image
      await delay(1000);
      console.log(element.innerText);
      // end of my action

      // remove the element to trigger some scroll event somewhere
      element.remove();

      // return another promise
      return scrollAndRemove()
    }
}

scrollAndRemove();



来源:https://stackoverflow.com/questions/52660676/node-js-puppteer-infinite-scroll-loop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!