Pagination when there is no “next page” button but bunch of “page numbers” pages

别等时光非礼了梦想. 提交于 2021-01-29 09:24:18

问题


I was happy doing my scrapping with R but found its limits. Trying to scrap the summary of cases of Argentina's Supreme Court, I found a problem for which I cannot find an answer. It is likely the outcome of learning by doing --- so please, do point out where my code works but is following a rather bad practice. Anyway, I managed to:

  1. Access the search page.
  2. Entry a relevant taxonomy term (e.g. 'DECRETO DE NECESIDAD Y URGENCIA') in #voces, click search and scrap the .datosSumarios, where lies the information I need (case name, date, reporter, and so on). The code is bellow:

const puppeteer = require('puppeteer');

let scrape = async () => {
    const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');

  // wait until element ready  
    await Promise.all([
        page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
        page.waitForSelector('.ui-menu-item')
    ]);

    await page.click('.ui-menu-item');

    await Promise.all([
    page.click('.glyphicon-search'),
    page.waitForNavigation({ waitUntil: 'networkidle0' }),
  ]);

    //Here we are in the place we want to be, and then capture what we need:     
    
    const result = await page.evaluate(() => {

        let data = []; // Create an empty array that will store our data
        
        let elements = document.querySelectorAll('.row'); // Select all Products

        for (var element of elements){ // Loop through each proudct
            
            let title = document.querySelector('.datosSumario').innerText;

            data.push({title}); // Push an object with the data onto our array

        }

        return data; // Return our data array
        
    });

    //review -> 
    
    await page.click('#paginate_button2')  

    browser.close();
    return result;
};

scrape().then((value) => {
    console.log(value); // Success!
});

What I can't seem to do is to go through different pages. If you follow the page you'll see that the pagination is rather strange: there is no "next page" button but a bunch of "page number buttons", which I can press but cannot iterate the srapping section of the code above. I've tried a loop function (that did not manage to make it work). I've reviewed a few pagination tutorials but could not found one that faces this particular kind of problem.

# Update

I was able to solve the pagination thing, but currently I can't seem to make a function to actually scrap the text I need to work within the pagination (it works outside, in a single page). Sharing in case someone can point the obvious mistake I am probably making.

const puppeteer = require('puppeteer');
const fs = require('fs');

let scrape = async () => {
    const browser = await puppeteer.launch({headless: false});
    const page = await browser.newPage();

    await page.goto('https://sjconsulta.csjn.gov.ar/sjconsulta/');

  // wait until element ready  
    await Promise.all([
        page.type('#voces', 'DECRETO DE NECESIDAD Y URGENCIA'),
        page.waitForSelector('.ui-menu-item')
    ]);

    await page.click('.ui-menu-item');

    await Promise.all([
    page.click('.glyphicon-search'),
    page.waitForNavigation({ waitUntil: 'networkidle0' }),
  ]);

    var results = []; // variable to hold the "sumarios" I need
    var lastPageNumber = 2; // I am using 2 to test, but I can choose any number and it works (in this case, the 31 pages I need to scrap)
    for (let index = 0; index < lastPageNumber; index++) {
        // wait 1 sec for page load
        await page.waitFor(5000);
        // call and wait extractedEvaluateCall and concatenate results every iteration.
        // You can use results.push, but will get collection of collections at the end of iteration
        results = results.concat(await MyFunction); // I call my function but the function does not work, see below 
        if (index != lastPageNumber - 1) {
            await page.click('li.paginate_button.active + li a[onclick]'); //This does the trick 
            await page.waitFor(5000);
        }
    }

    browser.close();
    return results;

};

    async function MyFunction() {
    
        const data = await page.evaluate( () => // This bit works outside of the async function environment and I get the text I need in a single page

            Array.from( 

                document.querySelectorAll('div[class="col-sm-8 col-lg-9 datosSumario"]'), element => element.textContent) 
    
            );

    }

scrape().then((results) => {
    console.log(results); // Success!
    
});

回答1:


You can try document.querySelector('li.paginate_button.active + li a[onclick]') as a next page button equivalent. After the click on it, you can wait for a response with URL started with 'https://sjconsulta.csjn.gov.ar/sjconsulta/consultaSumarios/paginarSumarios.html?startIndex='.

# For update

At first glance, there are some issues:

  1. MyFunction is not called: you need await MyFunction() instead of await MyFunction.

  2. You need to transfer page into MyFunction() scope:

  results = results.concat(await MyFunction(page));
//...
async function MyFunction(page) {
// ...
}


来源:https://stackoverflow.com/questions/62492290/pagination-when-there-is-no-next-page-button-but-bunch-of-page-numbers-pages

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!